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Abstract. We introduce a new estimator SMUCE (simultaneous multiscalc change- 
point estimator) for the change-point problem in exponential family regression. An un- 
known step function is estimated by minimizing the number of change-points over the 
acceptance region of a multiscale test at a level a. 

The probability of overestimating the true number of change-points K is controlled by 
the asymptotic null distribution of the multiscale test statistic. Further, we derive expo- 
nential bounds for the probability of underestimating K. By balancing these quantities, 
a will be chosen such that the probability of correctly estimating K is maximized. All 
results are even non-asymptotic for the normal case. 

Based on the aforementioned bounds, we construct asymptotically honest confidence 
sets for the unknown step function and its change-points. At the same time, we obtain 
exponential bounds for estimating the change-point locations which for example yield the 
minimax rate (^(n^ 1 ) up to a log term. Finally, SMUCE achieves the optimal detection 
rate of vanishing signals as n — > oo. 

We illustrate how dynamic programming techniques can be employed for efficient com- 
putation of estimators and confidence regions. The performance of the proposed multiscale 
approach is illustrated by simulations and in two cutting-edge applications from genetic 
engineering and photoemission spectroscopy. 



I. Introduction 

Assume that we observe independent random variables Y = (Yi, . . . , Y n ) through the ex- 
ponential family regression model 

Y ~ ifyi/n), for % = 1, . . . , n, (1) 
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where {Fg} e&e is a one dimensional exponential family with densities fg and $ : [0, 1) — > 
C R a right-continuous step function with an unknown number K of change-points. The 
two upper panels in Figure [T] depict such a step function with K = 8 change-points and 
corresponding data Y for the Gaussian family Fg = Af(6, a 2 ) with fixed variance a 2 . 
The change-point problem consists in estimating 

(i) the number of change-points of 

(ii) the change-point locations and the function values (intensities) of 
Additionally, we address the more involved issue of constructing 

(iii) confidence bands for the function $ and simultaneous confidence intervals for its 
change-point locations. 



1.1. Multiscale statistics and estimation. The goals (i) - (iii) will be achieved based 
on a new estimation and inference method for the change-point problem in exponential 
families: the Simultaneous MUltiscale Change-point Estimator (SMUCE). Let S denote 
the space of all step functions on the unit interval [0,1) with values in 0. For $ G S 
we denote by J(^) the ordered vector of change-points and by #J(^) its length, i.e. the 
number of change-points. In a first step, we propose to solve the (nonconvex) optimization 
problem 

inf#J(tf) s.t. T n (Y,$)<q, (2) 

where q is a threshold to be specified later. T n (Y, $) is a certain multiscale statistic for 
a candidate function $ G S. Optimization problems of the type ^ have been recently 
considered in [S] for Gaussian change-point regression (see also [9] for a related approach) 
and for volatility estimation in [21]. T n in ^ evaluates the maximum over the local 
likelihood ratio statistics on all discrete intervals [i/n,j/n] such that $ is constant on these 
with value 9 = 9 it j, i.e. 

T n {Y^)= max U !!*{¥,&)- J 2\ag 6 " V (3) 

l<Kj<n V V W ? — ?. + 1 / 

i?(i)=0 for te[i/n,j/n] v 

where e = exp(l) and log denotes the natural logarithm. The local likelihood ratio statistic 
T- for testing H : 9 = 9 against Hi : 9 ^ 9 on the interval [i/n,j/n] is defined as 

7?(V-,« = log(%^^y (4) 

It measures how well the data can be described locally by a constant value #o on the 
interval [i/n,j/n\. We stress that the multiscale statistic T n does not act on all intervals 
[i/n,j/n] C [0, 1] but only on those which the candidate function $ is constant on, see also 
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[2"Ij 0U EB] • Thus the system of intervals appearing in ^ makes up the specific multiscale 
nature of T n . The log-expression in (|3| can be seen as a scale calibrating term that puts 
different scales on equal footing. As argued in [32J and [32] this improves the power of the 
multiscale test over the majority of scales. Roughly speaking, from a multiscale point of 
view, scale-calibration becomes advantageous, since there are many more small intervals 
than large ones. 

SMUCE integrates the multiscale test on the r.h.s. in ^ into two simultaneous estimation 
steps: Model selection (estimation of K) and estimation of $ given K. The minimal value 
of #J in ([2]) gives the estimated number of change-points, denoted by K(q). To obtain 
the final estimator for § first consider the set of all solutions of (|2]) given by 

C(q) = |$ G iS : #J(tf) = K(q) and T n (Y, 0) < g} , (5) 

which constitutes a confidence set for the true regression function •& as we will discuss later 
on. Then, the SMUCE d{q) is defined to be the constrained maximum likelihood estimator 
within this confidence set C(q), i.e. 

n 

Hq) = argmax V) log (/*(i/„) (Yi)) • ( 6 ) 

The lower panel in Figure [I] shows an example of a SMUCE (red solid line) for Gaussian 
observations. As stressed above, the multiscale constraint on the r.h.s. of (|2| renders the 
SMUCE sensitive to the multiscale nature of the signal The signal in Figure [T] is a case 
in point: It exhibits large and small scales simultaneously and remarkably the SMUCE 
i?(g) recovers them both equally well. 

1.2. Deviation bounds and confidence sets. The parameter q £ R in (|2j plays a crucial 
role because it governs the trade-off between data-fit (the r.h.s. in ([2])) and parsimony (the 
l.h.s. in (I2J). It has an immediate statistical interpretation. From ^ it follows that 

P (K{q) >K)< P(T„,(y,tf) > q). (7) 

Hence, by choosing q = g 1 _ Q to be the 1 — a-quantile of the (asymptotic) null distribution 
of T n (Y,i?), we can (asymptotically) control the probability of overestimating the number of 
change-points by a. In fact, we show that the null distribution of T n (Y, $) can be bounded 



asymptotically by a distribution which does not depend on i? anymore (see Section [2^2 ). 
It is noteworthy that for Gaussian observations this bound is even non-asymptotic (see 
Section 2.4). The third panel in Figure [l] shows for different choices of a (y-axis) the 



corresponding estimates for the change-point locations (black dots; the vertical ticks mark 
the true change-point locations). The number of estimated change-points is monotonically 
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Figure 1. From top to bottom: 1. True regression function 2. Gaussian 
observations Y with n = 367 and variance a 2 = 1. 3. Estimated change- 
point locations with confidence intervals for different values of 1 — a. 4. 
SMUCE $(q a ) with confidence bands (gray hatched area) and confidence 
intervals for the change-point locations (inward pointed arrows) at a = 0.4. 



decreasing in a in accordance with ([7]) which guarantees at level a that SMUCE has not 
more jumps than the true signal We emphasize that the SMUCE is remarkably stable 
w.r.t. the choice of a, i.e. the number of change-points K = 8 is estimated correctly for 
0.2 < a < 0.9. Our simulations in Section [5] confirm this stability even in non-Gaussian 
scenarios. 

As mentioned before, SMUCE controls the error of undersmoothing ([7]), i.e. the probability 
of overestimating the number of change-points. In addition, we prove an exponential 
inequality that bounds the error of oversmoothing, i.e. the probability of underestimating 
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the number of change-points 

P (k(q) <K)< 2Ke- CnXA2 



j(< ? + v /21og(2 e /A)) 2 e -Cn\A 2 



This bound only depends on the smallest interval length A, the smallest absolute jump size 
A and the number of change-points K of the true regression function Here, C > is 



some universal constant only depending on the family of distributions (see Section 2.3). 



As a consequence of the inequalities ([7j and ([8]), C(qi- a ) in ^ constitutes an asymptotic 



confidence set at level 1 — a and we will explain in Section |3.2| how confidence bands for 
the graph of •& and confidence intervals for its change-points can be obtained from this. 
See the lowest panel of Figure [T] for illustration. 

Of course, asymptotically honest (i.e. uniform) confidence sets cannot be obtained on the 
entire set of step functions S, as A and A can become arbitrarily small. Nevertheless, we can 
show that simultaneously both, confidence bands for and intervals for the change-points 
are asymptotically honest w.r.t. to a sequence of nested models S^ n ' C S that satisfy 

A^A n — 7- oo, as n — > oo, (9) 



logn 



i.e. the confidence level a is kept uniformly over as n — > oo (c.f. Section |2.6[ ). Here 
A n and A n denote the smallest interval length and smallest absolute jump size in S^ n \ 
respectively. 

1.3. Choice of q. Balancing the probabilities for over- and underestimation, gives an 
upper bound on P(K(q) ^ K) the probability that the number of change-points is mis- 
specified. This bound depends on n, q, A and A in an explicit way and opens the door for 
several strategies to select q such that P(K(q) = K) is maximized. One may incorporate 
prior information on A and A and we suggest a simple way how to do this in Section |4} 
Under a suitable choice of q = q n the probability of misspecification tends to zero and hence 
K{q n ) converges to the true number of change-points K (model selection consistency), such 
that the underestimation error in ^ vanishes exponentially fast. 

Finally, we obtain explicit bounds on the precision of estimating the change-point locations 
which again depend on q, n, A and A. For any fixed q > they are recovered for all esti- 
mators in C(q), including SMUCE, at the optimal rate 1/n (up to a log-factor). Moreover, 
these bounds can be used to derive slower rates uniformly over nested models as in ^ (see 



Section 2.6) 



1.4. Detection power for vanishing signals. For the case of Gaussian observations 
we derive the detection power of the multiscale statistic T n in (|3]), i.e. we determine the 
maximal rate at which a signal may vanish with increasing n but still can be detected with 
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probability 1, asymptotically. For the task of detecting a single constant signal against 
a noisy background, we obtain the optimal rates and constants (cf. [THl E21 EH EH])- 
We extend this result to the case of an arbitrary number of change-points, retrieving 



the same optimal rates but different constants (Section 2.5). Similar results have been 
derived recently in [49J for sparse signals, where the estimator takes into account the 
explicit knowledge of sparsity. We stress that the SMUCE does not rely on any sparsity 
assumptions still it adapts automatically to sparse signals due to its multiscale nature. 

1.5. Implementation, simulations and applications. The applicability of dynamic 
programming to the change-point problem has been subject of research recently (cf. e.g. 
[3H ESI HE])- The SMUCE $(g) can also be computed by a dynamic program due to the 
restriction of the local likelihoods to the constant parts of candidate functions. This has 
already been observed by [S] for the multiscale constraint considered there. We prove 
that ^ can be rewritten into a minimization problem of a penalized cost function with a 



particular data driven penalty (see Lemma 3.1). 



Much in the spirit of the dynamic program suggested in [M] , our implementation exploits 
the structure of the constraint set in ^ to include pruning steps. These reduce the worst 
case computation time 0(n 2 ) considerably in practice and makes it applicable to large data 
sets. Simultaneously, the algorithm returns a confidence band for the graph of d as well 
as confidence intervals for the location of the change-points (Section [3]), the latter without 
any additional cost. An R-package (stepR) including an implementation of SMUCE is 
available online B 

Extensive simulations reveal that the SMUCE is at least competitive with (and indeed often 
outperforms) state-of-the-art methods for the change-point problem which all have been 
tailor-made to specific exponential families (Section [5]). Our simulation study includes the 
CBS method [HE], the fused lasso [H2] and the modified BIC [91 J for Gaussian regression, 
the multiscale estimator in [2T] for piecewise constant volatility and the extended taut 
string method for quantile regression in [30] • In our simulations we consider several risk 
measures, such as the MSE and the model selection error P(K ^ K). Moreover, we study 
the feasibility of our approach for different real-world data sets, including two benchmark 
examples from genetic engineering and a new example from photoemission spectroscopy 
[8j H6] which amounts to Poisson change-point regression. 

1.6. Literature survey and connections to existing work. The problem of detecting 
changes in the characteristics of a sequence of observations has a long history in statistics 
and related fields, dating back to the 1950's (see e.g. [OH]). In recent years, it experienced 
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a renaissance in the context of regression analysis due to novel applications that mainly 
came along with the rapid development in genetic engineering (cf. JTU1 SSI EJJ ESI EE]) 
and financial econometrics (cf. [ZD, SO EDI 1ST]). Due to the widespread occurrence of 
change-point problems in different communities, such as statistics, electrical engineering, 
machine learning, physics, econometrics and biology, an exhaustive list of existing methods 
is beyond reach. For a selective survey, we refer the reader to the books [21 [JTJ [151 HH] an d 
the extensive list in |52j . 

Our approach as outlined above can be considered as a hybrid method of two well- 
established approaches to the change-point problem: 

Likelihood ratio statistics, on the one hand, are frequently employed to test for a change 
in the parameter of the distribution family and to construct confidence regions for change- 
point locations. Approaches of this type date back as far as [201 EJJ and have gained 
considerable attention afterwards (see [291 H21 S21 EH1 ES EJJ and [lj El [80] for general- 
izations to the multivariate case). The likelihood ratio test was also extensively studied 
for sequential change-point analysis (see [TTJ [791 [88J ) . All these methods are primarily 
designed to detect a predefined maximal number (mostly one) of change-points. 
On the other hand, if the number of change-points is unknown, an additional model selec- 
tion step is required, which can be achieved by proper penalization of model complexity, 
e.g. measured by the number of change-points itself or by surrogates for it. This is often 
approached by maximizing a penalized likelihood function of the form 

■& H> Z(y,i?) -pen(tf) 

over a suitable space of functions, e.g. S as in this paper or functions of bounded variations 
[53] . etc. Here l(Y, i?) is the (log) likelihood function. The penalty term pen($) penalizes 
the complexity of $ and prevents overfitting. It increases with the dimension of the model 
and serves as a model selection criterion. First approaches include BIC-type penalties [89] 
and more sophisticated penalties have been advocated later on (see e.g. [21 [7J El [TQl I58T 
EQlEO]). Further prominent penalization approaches include the fused lasso procedure (see 
[57] |8"2"] and [10]) that uses a linear combination of the total- variation and the £ 1 -norm 
penalty as a convex surrogate for the number of change-points which has been primarily 
designed for the situation when $ is sparse. Finally, aggregation methods [TTJ have been 
advocated recently for the change-point regression problem as well. 

Most similar in spirit to our approach are estimators which minimize target functionals 
under a statistical multiscale constraint. For some early references see [261 EE] and more 
recently [El [221 1211 [35] . In our case this target functional equals the number of change- 
points. 
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The multiscale calibration in (|3| is based on the work of [TBI E21 E3] • Multiscale penalization 
methods have been suggested in EE] , multiscale partitioning methods including binary 
segmentation in |68l 1761 184"] . and recursive partitioning in [56] . 

Aside to the connection to frequentist's work cited above, we claim that our analysis also 
provides an interface for incorporating a priori information on these quantities into the 
estimator (see Section [1]). We stress that for minimizing the bounds in ([7]) and ^ on the 
model selection error P(K(q) ^ K) it is not necessary to include full priors on the space 
of step functions S. Instead it suffices to simply specify a prior on the smallest interval 
length A and the smallest absolute jump size A. The parameter choice strategy discussed 



in Section [4] or the limiting distribution of T n (Y, $) in Section 2.2, for instance, can be 
refined within such a Bayesian framework. This, however, will not be discussed in this 
paper in detail and is postponed to future work. For recent work on a Bayesian approach 
to the change-point problem we refer to [28| [3U [63], [70] and the references therein. 
We finally stress that there is a conceptual analogy of SMUCE to the Dantzig selector as 
introduced in [13] for Gaussian observations for estimating sparse signals in high dimen- 
sional linear regression models (see [48] for an extension to exponential families). Here 
the £]-norm of the signal is to be minimized subject to the constraint that the residuals 
are pointwise within the noise level. The SMUCE, in contrast, minimizes the £ - norm of 
the discrete derivative of the signal subject to the constraint that the residuals are tested 
to contain no signal on all scales. We will briefly address this and other relations to re- 
cent concepts in high dimensional statistics in a discussion in Section [6j In summary, the 
change-point problem is an "n = p" problem and hence substantially different from high 
dimensional regression where u p 3> n". As we will show, multiscale detection of sparse 
signals becomes then possible without any sparsity assumption entering the estimator. 
Another major statistical consequence of this paper is that post model selection inference 
becomes doable over a large range of scales uniformly over nested models in the sense of 

2. Theory 



This section summarizes our main theoretical findings. In Section 2J3 we discuss consis- 
tency of the estimated number of change-points. This result follows from an exponential 
bound for the probability of underestimating the number of change-points on the one hand. 
On the other hand we show how to control the probability of overestimating the number of 



change-points by means of the limiting distribution of T n (Y, $) as n — > oo (cf. Section 2.2). 
We give improved results, including a non-asymptotic bound for the probability of overes- 
timating the number of change-points, for Gaussian observations (Sections |2.4 & 2.5). In 
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Section 2.6 we finally show that the change-point locations can be recovered as fast as the 
sampling rate up to a log-factor and discuss how asymptotically honest confidence sets for 
•d can be constructed over a suitable sequence of nested models. 



2.1. Notation and model. We shall henceforth assume that F = {Fg} eeQ is a one- 
dimensional, standard exponential family with //-densities 

f g (x) = exp(6x-ip(0)) 1 xeR. (10) 

Here = {9 G K : / R exp(#x) du(x) < oo} C K denotes the natural parameter space. 
We will assume that F is regular and minimal which means that is an open interval 
and that the cumulant transform ip is strictly convex on 0. We will frequently make use 
of the functions 

m(9) :— ip(9) — E (X) and v{9) := ^(0) = Var (X) , (11) 
for X ~ Fg. Note that m and v are strictly increasing and positive on 0, respectively. 



2.1.1. Observation model and step functions. We assume that Y = (Yi, . . . ,Y n ) are inde- 
pendent observations given by Q where & : [0, 1) ->■ is a right continuous step function, 
that is 

K 

m-J2 e ^ k+1 )(t), (12) 

fc=0 

where = r < T\ < . . . < Tk < t~k+i = 1 are the change-point locations and ^ G 
the corresponding intensities, such that 9k ^ 9k+i for k — 0, . . . , K. The collection of step 
functions on [0, 1) with values in and an arbitrary but finite number of change-points will 
be denoted by S. For °t) 6 S as in (12) we denote by J(i?) = (n, . . . ,rx) the increasingly 
ordered vector of change-points and by #J($) = if G N its length. We will denote the set 
of step functions with K change-points and change-point locations restricted to the sample 
grid by S n [K] C S. 

For any estimator $ of $ £ <S, the estimated number of change-points will be denoted by 
#</(-#) = K, the change-point locations by J(i?) = (fi, . . . ,f^) and we set 9k = "&(t) for 
t G [fjfc, Tjfc+j). For simplicity, for each n G N we restrict to estimators which have change- 
points only at sampling points, i.e. S G ^[ii'] with ffe = 4/n for some 1 < 4 < n. To keep 
the presentation simple, throughout the following we restrict ourselves to an equidistant 
sampling scheme as in Q. However, we mention that extensions to more general designs 
are possible. 
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2.1.2. Multiscale statistic. Let 1 < i < j < n. Then, the likelihood ratio statistic T-(Y,\ 
in ^ can be rewritten into 



7i(Y, B ) = snp ~ ~ 5>° F < - ^o))- 

eee J ^ 

Introducing the notation 4>(x) = sup^gg Ox — ip(6) for the Legendre-Fenchel conjugate of ip 
and J(x,8) = <p(x) — {9x — ip{0)) we find 

Tf{Y,6 ) = {j-i + l)J(Y j i ,e )>0, 
where Y\ = (X/i<Ki^)/C7 — * + 1). The multiscale statistic T n (Y, i?) in (|3| was defined 
to be the (scale calibrated) maximum over all \J 2T/ such that < i < j < 4+1 for some 
< < i^. As mentioned in the introduction we sometimes will restrict the minimal 
interval length (scale) by a sequence of lower bounds (c n ) n£ N tending to zero. In order to 
ensure that the asymptotic null distribution is non degenerate, we assume for non-Gaussian 
families (see also [H]) 

n~ l log 3 n/c n -»• 0. (13) 
Then, the modified version of ^ reads as 



T n (Y^-c n ) = max max ( JlT}{Y, k ) - ,/21og , ^ - 1. (14) 

(j-i+l)/n>c n 

2.2. Asymptotic null distribution. We give a representation of the limiting distribution 



of the multiscale statistic T n in (14) in terms of 



f\B(t)-B(s)\ r ~\ , . 

M:= sup ' V V V )l -4/21og- , (15) 

0<s<i<l \ yjt - S V t-Sj 

where (B(t))t>o denotes a standard Brownian motion. We stress that the statistic M is 
finite almost surely and has a continuous distribution supported on [0, oo) (cf. [31, 32J). 



Theorem 2.1. Assume that (c n ) n£ N satisfies (13). Then, 



T n (y,tf;c n )4 max sup ( ' K "' - j21og— ) . (16) 

< k < K T k <S<t<T k + 1 V V^-S V t-S' 



B(t)-B(s)\ [—^ 



Further, let Mo, . . . , Mr be independent copies of M. Then, the right hand side in (2.1 ) is 
stochastically bounded from above by M and from below by 



max Mfc — \ 2 log 

0<k<K \ U Tk+1 - r k 
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It is important to note that the limit distribution in ( 16 ) (same as the lower bound) depends 



on the unknown regression function $ only through the number of change-points K and 
the change-point locations rfc, i.e. the function values of d do not play a role. From the 



upper bound in Theorem 2A_ we obtain 

lim P(T n (r,tf;c n ) <q a ) > a, 

n— >oa 

with q a being the a-quantile of M. In practice the distribution of M is obtained by 



simulations. In Section 2.4 we will see that for the Gaussian case even a nonasymptotic 



version of Theorem 2.1 can be obtained, which allows for finite sample refinement of the 
null distribution of T n . We note that these are not supported in [0, oo) (see Figure [2j. 
To the best of our knowledge, it is an open and challenging problem to derive bounds for 
the tails of M (cf. [31-33J) which is not addressed in this article. By such bounds the 
probability of overestimating the number of change-points could be controlled explicitly, 
as we will see in the upcoming section. 





Figure 2. Simulations of the cdf (left) and density (right) of M in ( [15] ) 
with n = 50,500 and 5000 equidistant discretization points. 



2.3. Exponential inequality for the estimated number of change-points. In this 
section we derive explicit bounds on the probability that K{q) as defined in (pj underesti- 



mates the true number of change-points K. In combination with the results in Section 2.2 



these bounds will imply model selection consistency, i.e. P(K(q n ) = K) — y 1 for a suitable 
sequence of thresholds (q n )n&i in 

We first note, that with the additional constraint in (14) on the minimal interval length, 
the estimated number of change-points is given by 

K(q) = min {K G N : 3$ G S n [K] : T n (Y, 0; c n ) < q} , q G R. (17) 

Now let A and A be the smallest absolute jump size and the smallest interval length of the 
true regression function G S, respectively and assume that i?(t) G [9, 9} for all t G [0, 1]. 
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We give the aforementioned exponential upper bound on the probability that the num- 
ber of change-points is underestimated. The results follows from the general exponential 
inequality in the supplement, Theorem |7. 10 



Theorem 2.2 (Underestimation bound). Let gel and K(q) be defined as in (17) with 
A > 2c n . Then, there exists a constant C = C(J-, 8, 9) > s.t. 

) |( g+v /21og(2e/A)) 



P \K(q) < Kj < 2Ke- CnXA2 
From Theorem 17.101 and Lemma 17.111 it follows that 

c{F,e, 



+ e 



-Cn\A z 



n e N. 



iwf g<g<s v{ty 



(19) 

oSUPg< e <gf(t) 

which gives C = 1/8 for the Gaussian family and C = fi 2 /(8]l) for the Poisson family, 
given ra($) e [/i, Ji] in the latter case. 

On the one hand, if q = q n and q n j y/n — > as n — > oo, it becomes clear from Theorem 
that K(q n ) > K with high probability. On the other hand, it follows from Theorem 
that T n {Y, t9; c n ) is bounded almost surely as n — > oo if c n is as in (13). This in turn 

(20) 



2.2 



2.1 



implies that the probability for K(q n ) < K tends to 1, since 

P (k{q n ) >K)< P(T n (Y } $; c n ) > q n ) -> 
whenever q n — > oo, as n — > oo. Thus we summarize 



Theorem 2.3 (Model selection consistency). Let the assumptions of Theorems 2.1 and 



2.2 hold and additionally assume that q n — » oo and q n j 'y/n — > as n — )■ oo. Then, 

lim P(K(q n ) — K) — 1. 

n—>-oo 

Giving a non-asymptotic bound for the probability for overestimating the true number 
of change-points (in the spirit of (20)) appears to be rather difficult in general. For the 
Gaussian case though this is possible, as we will show in the next section. 

2.4. Gaussian observations. We now derive sharper results for the case when J 7 is the 
Gaussian family of distributions with constant variance. In this case ([T]) reads as 

fj>(i/n) + aEi, i = l,...,n (21) 



Yi 



where £±, . . . , e n are independent Af(Q, 1) random variables, a > and // G S denotes the 
expectation of Y. To ease notation we assume in the following that a = 1. For the general 
case replace A by A/ a. 

In the Gaussian case it is possible to get rid of the lower bound for the smallest scales 
c n as in (13) because the strong approximation by Gaussian observations in the proof of 
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Theorem 2.1 becomes superfluous. We obtain the following non- asymptotic result on the 



null distribution. 

Theorem 2.4 (Null Distribution of T n ). For any n G N 



max Mfc 

0<k<K 



2 log 



1 



< T n (Y,§) < M {n) < M, 



where is as M in (15) where the supremum is only taken over the system of discrete 
intervals [i/n,j/n]. 



In contrast to Theorem 2.1, this result is nonasymptotic and the inequality holds for any 
sample size. For this reason, we get the following improved upper bound for the probability 
of overestimating the number of change-points. 



Corollary 2.5 (Overestimation bound). Let q G R and K(q) be defined as in (17). Then 
for any nGN 

P (k(q) > K) <P(M>q). 



This corresponds to the "worst case scenario" for overestimation when the true signal 
has no jump. 

For the probability of underestimating the number of change-points, we can improve The- 
orem 



2.2 for Gaussian observations (see Theorem 7.12) to 



P (k(q) <K) <2K 



1 / AVXn 



exp 



8 I 2^2 




exp 



16 



(22) 



2.5. Multiscale detection of vanishing signals for Gaussian observations. We will 

now discuss the ability of SMUCE to detect vanishing changes in a signal. We begin with 
the problem of detecting a signal on a single interval against an unknown background. 

Theorem 2.6. Let $ n (t) = 6 + 8 n l n (t) for some 6 , 8 + 5 n G and for some I n C [0, 1] 



and Y be given by (21). Further let (q n ) ne N be bounded away from zero and assume 



(1) for signals on a large scale (i.e. liminf |/ n | > 0), that ^/\I n \ n5 n /q n — » oo, 

(2) for signals on a small scale (i.e. \I n \ — » 0), that ^/\I n \ n5 n > (\/2-|-£ n )ylog(l/ |/ n |) 
with e n , s.t. e n ^log(l/ |/ n |) -)> oo and sup ngN q n /(e n \/\og(l/ \I n \)) < I. 

Then, 

sup P K {T n {Y,# ) <q n ) ->0. (23) 
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Theorem 2.6 gives sufficient conditions on the signals i9 n (through the interval length \I n \ 
and the jump height 5 n ) as well as on the thresholds q n such that the multiscale 
statistic T n detects the signals with probability 1, asymptotically; put differently, this 
means P(K(q n ) > 0) — > 1. We stress that the above result is optimal in the following sense: 
No test can detect signals satisfying \/\I n \ nd > (a/2 — e n )^/\og(l/I n ) with asymptotic 



power 1 (see [161 

For the special case, when q n = q a is a fixed a-quantile of the null distribution T n (Y,"d n ) 



or of the limiting distribution M in (15)), the result boils down to the findings in 



In particular, aside to the optimal asymptotic power (23), the error of first kind is bounded 
by a. The result in Theorem [23] goes beyond that and allows to shrink the error of first 
kind to zero asymptotically, by choosing q n — > oo. 



We finally generalize the results in Theorem 2.6 to the case when e S has more than one 



change-point. To be more precise, we formulate conditions on the smallest interval and 
the smallest jump in $ such that no change-point is missed asymptotically. 

Theorem 2.7. Let ($ n ) ne N be a sequence in S with K n change-points and denote by A n 
and X n the smallest absolute jump size and smallest interval in d n , respectively. Further, 
assume that q n is bounded away from zero and 

(1) for signals on large scales (i.e. liminf A n > 0), that \/X n nA n /q n — > oo. 

(2) for signals on small scales (i.e. X n — > 0) with K n bounded, that \/X n nA n > (4 + 
e n ) v / log(l/A„) with e n ^\og(l/X n ) -> oo and g n /(e nV /log(l/A n )) < 1/2. 

(3) for signals on small scales (i.e. X n — > 0) with K n unbounded, that \/X n nA n > 
(12+e„)y / log(l/A n ) withe ny /\og(l/\ n ) -)> oo and sup neN q n / (e^\og(l/ X n )) < 1/2. 

Then, 

Ptf„ {k{q n )>K^j ^1. 
Theorem 2.7| essentially amounts to say that the statistic T n is capable of detecting multiple 



change-points simultaneously at the same optimal rate (in terms of the smallest interval 
and jump) as a single change-point. The only difference being the constants that bound 
the size of the signals that can be detected. These increase with the complexity of the 
problem: v2 for a single change against an unknown background, 4 for a bounded (but 
unknown), and 12 for an unbounded number of change-points. In [49] it was shown that 
for step functions that exhibit certain sparsity patterns the optimal constant a/2 can be 
achieved. It is important to note that we do not make any sparsity assumption on the true 
signal. Finally we mention an analogy to Theorem 4.1. of [33] in the context of detecting 



local increases and decreases of a density. As in Theorem 2.7 only the constants and not 



the detection rates changes with the complexity of the alternatives. 
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2.6. Estimation of change-point locations and simultaneous confidence sets. In 

this section we will provide several results on confidence sets associated with SMUCE. We 
will see that these are linked in a natural way to estimation of change-point locations. We 
generalize the set C(q) in ^ by replacing T n (Y,§) in ^ with T n (Y,i?, c^) as in (14) and 
consider the set of solutions of the optimization problem 



inf #J(0) s.t. 



T n (Y,^c n ) <q. 



(24) 



Any candidate in C(q) recovers the change-point locations of the true regression function 
$ with the same convergence rate. It is determined by the smallest scale (c n ) n6 N for 



the considered interval lengths in the multiscale statistic T n in (14) and hence equals the 
sampling rate up to a log factor. 



Theorem 2.8. Let q G R andC(q) be the set of solutions of (24) and assume that (c n 



IneN 



is a sequence in (0, 1]. Further choose C = 0(^,6,6) > as in (19). Then, for all n G N 



P I sup max mm |r 

^ec( 3 ) rem 



T > C r , 



< 2Ke~ 2CnCnA2 



,l(<?+V /21 °s( e / c ")) 2 _j_ e -2Cnc n A 



For a fixed signal $ G S, a sufficient condition for the r.h.s. in Theorem (2.8) to vanish as 
n — > oo is 



" A 2 C 



1 logra 



Here the constant C matters, e.g. in the Gaussian case (7 = 1/8 (cf. Section 2.3). This 
improves several results obtained for other methods, e.g. in [40] for a total variation 
penalized estimator a log 2 n/n rate has been shown. 



In the following we will apply Theorem (2.8) to determine subclasses of S in which the 



change-point locations are reconstructed uniformly with rate c n . These subclasses are 
delimited by conditions on the smallest absolute jump height A n and on the number of 
change-points K n (or the smallest interval lengths A n by using the relation K n < 1/A n ) of 
its members. For instance, the rate function c n = with some G [0,1) implies the 
condition 



nP exp( 



-n 



-> 0. 



The choice (3 = gives the largest subclass but no convergence rate is guaranteed since 
c n = 1 for all n. A value of (3 close to 1 implies a much smaller subclass of functions which 
then can be reconstructed with convergence rate arbitrarily close to the sampling rate 1 / 
n. We finally point out that the result in Theorem |2.8 does not presume the number of 



change-points to be estimated correctly. If c n additionally satisfies ( 13 ) and if in Theorem 
(2.8) q = q n — > oo slower than — logc n , we find from Theorem 2.3 that P(K(q) = K) — > 1 
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and it follows from Theorem (2.8) that for n large enough 




The solution set of the optimization problem (24) constitutes a confidence set for the true 
regression function d. Indeed, we find that 

(25) 



P (tf G C(q)) = P (T n (Y, #; c n ) < q, K < K(q) 

> P (T n (Y,&; c n ) < q) - P (k(q) < K ) . 



In particular, it follows from Theorem |2.3 that if q\- a is the 1 — a-quantile of M, the set 
C(qi- a ) is an asymptotic confidence set at level 1 — a. 



Corollary 2.9. Let a G (0, 1) and to be the 1 — a-quantile of the statistic M in (15) 
Then, 



P (0 G C(gi_«)) > 1 - a - 2ife 



-CnAA 2 



e |(g+ v /21og(2e/A)) e _CnAA 2 



+o(r 



(26) 



with C = 0(^,6, 9) > as in Theorem 2.3 Consequently one finds 

lim P (tf G C(gi_«)) > 1 - a. 

n— >oo 

for any i? G <S. 



We mention that for the Gaussian family (see Section 2.4) the inequality (26) even holds 



for any n, i.e. the o(l) term on the r.h.s. can be omitted. Thus the r.h.s. of (26) gives an 
explicit and nonasymptotic lower bound for the true confidence level of C(q a ). 
In the following we use this result to determine classes of step functions on which confidence 
statements hold uniformly. Being a subset of S, the confidence set C(q) is hard to visualize 



in practice. Therefore, in Section 3.2 we compute a confidence band B(q) C [0, 1] x 6 that 



contains the graphs of all functions in C(q) as well as disjoint confidence intervals for the 
change-point locations denoted by [r l k (q), T r k (q)] C [0, 1] for k — 1, . . . , K(q). For the sake 
of simplicity, we denote the collection {K(q), B(q), { [r l k (q), T r k (q)}} k=l _ k{q) } by I(q) and 
agree upon the notation 

■& -< I(q) if K(q) = K, (t, m) e B(q) and r k G [r l k (q), r r k (q)} for k = 1, . . . , K, (27) 
i9 -/< I(q) otherwise. 

Put differently, $ -< I(q) implies that simultaneously the number of change-points is es- 
timated correctly, the change-points lie within the confidence intervals and the graph is 



contained in the confidence band. As we will show in Section 3.2, the confidence set C(q) 
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and 7(g) are linked by the following relation: 

G C(q) ■& -< 7(g). (28) 

Following the terminology in [U2], 7(g) is called asymptotically honest for the class S at 
level 1 — a if 

liminf inf P ($ -< 7(g)) > 1 - a. 

Such a condition obviously cannot be fulfilled over the entire class S, since signals cannot 
be detected if they vanish too fast as n — > oo. For Gaussian observations this was made 



precise in Section 2.4 



To overcome this difficulty we will relax the notion of asymptotic honesty. Let C S, 
n G N be a sequence of subclasses of S. Then 7(g) is sequentially honest w.r.t. at 
level 1 — a if 

liminf inf P (■& -< 7(g)) > 1 - a. 

n— s>oo #g5(n) 



By combining (|25j), (28) and Corollary 2.2 we obtain the following result about the asymp- 
totic honesty of 7(gi_ a ). 

Corollary 2.10. Let a G (0, 1) and gi_ Q be the 1 — a-quantile of the statistic M in 
(15) and assume that (& n ) n£ N oo is a sequence of positive numbers. Define S^ n > = 

G S : nAA 2 /log(l/A) > b n , 9_<§< 6*}. Then 7(gi_ a ) is sequentially honest w.r.t. 
at level 1 — a, i.e. 

lim inf P (i? -< 7(g a )) > 1 - a. 

By estimating 1/A < n we find that the confidence level a is kept uniformly over nested 
models C S, as long as A 2 A n — >■ oo. Here A n and A n is the smallest interval length 
and smallest absolute jump size in S^, respectively. 



3. Implementation 

We now explain how the SMUCE, i.e. the estimator $(q) with maximal likelihood in 
the confidence set C(q), can be computed efficiently within the dynamic programming 
framework. In general the proposed algorithm is of complexity 0(n 2 ). We will show, 
however, that in many situations the computation can be performed much faster. 
Our algorithm uses dynamic programming ideas from [38] in the context of complexity 
penalized M-estimation. See also [21] I4"4"] for a special case in our context. Moreover, 
we include pruning steps as [54] . who also provide a survey on dynamic programming in 
change-point regression from a general point of view. We will show that it is always possible 
to rewrite $(q) as a solution of a minimization of a complexity penalized cost function with 
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data dependent penalty. To this end, we will denote the log-likelihood of ■d as 

n 



i=l 



Without restriction, we will assume that OY^) > for all d G S. 

Following [3H], we call a collection V of discrete intervals a partition if its union equals 
the set {1, . . . , n}. We denote by ^P(ra) the collection of all partitions of {1, . . . , n}. For 
V G ^(n) let #P the number of discrete intervals in V. Hence, any discrete step function 
•d G <S n [if] can be identified with a pair (V, 9), where 

V G *p(n), #V = K and 9 = (0j) /e7 > G Q* v , 

and $(t) = 9j ^ [nt\ G /. Next, we note that for a given 9j G the negative log-likelihood 
on a discrete interval I is given by \I\(ip(9i) — OfY {). With this we define the costs of Qi 
on / as 



im^-OYt) if max^jy/TT^Y^-^log^Kq 

oo else. 

The minimal costs on the interval / are then defined by d*j = min^ge di(Y, 9i) where we 
agree upon 9} G being such that dj(Y, 9}) = d}. We stress that d} = oo if and only if no 
9 1 G exists such that the multiscale constraint is satisfied on /. Finally, for an estimator 
(V, 9) the overall costs are given by 

D(V,9) = J2di(Y,9 I ). 

lev 

In [3H] a dynamic program is designed for computing minimizers of 

(V,9)^D(V,9)+j(#V-l), 7 >0. (30) 
It is shown that the computation time amounts to 0(n 2 ) given that the minimal costs 



d*j can be computed in 0(1). We now show that each minimizer of (30) maximizes the 
likelihood over the set C(q), if 7 > is chosen large enough. Note that this 7 can be 
computed explicitly for any given data (Yi, . . . , Y n ) according to the next result. 

Lemma 3.1. Let 7 > 1/2 (nq + n^j2 log(en) j + l(Y,rrT l (Y)). Then, any solution of 
(30) is also a solution of (j6j). 



For completeness, we briefly outline the dynamic programming approach for the minimiza- 
tion of (30) as established in [3H]: Define for r < n the Bellman function by B(0) = —7 
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and 



B(r)= inf D{V, 9) + j(\V\ - 1) 



and let V r G <P(r) and 9 r G G 1 ^ 1 be such that D(V r ,9 r ) + ^(\V r \ - 1) = B(r). Clearly, 



B(n) is the minimal value of (30) and (V n ,9 n ) is a minimizer of (30). A key ingredient is 



the following recursion formula (cf. [3H1 Lem. 1]) 

B(p)= inf B( r -l)+ 7 + d* 

l<r<p L ' 

Let p < n and assume that (P r , # r ) are given for all r < p < n. Then, compute the best 
previous change-point position, i.e. 

r p = argmin_B(r - 1) + d* r ^ (31) 

l<r<p 

and set V p = V r -\ U {[r p ,p]} and 9 P = (6^-1, 9\ rptP ]). With this we can iteratively compute 
the Bellman function B(p) and the corresponding minimizers (V p , 9 P ) for p — 1, . . . , n and 



eventually obtain (V n ,9 n ), i.e. a minimizer of (30). According to Lemma 3.1, this (V n ,9 ri 
solves ^ if 7 is chosen large enough. 



We stress that the specific structure of the costs (see (29)) can be employed in the dynamic 



program by including pruning steps, similar to [M]- The idea behind this is to consider 



only such r in (31) that may lead to a minimal value, i.e. those r that are strictly larger 
than max j r : d* r p j = oo j . 

Given the minimal costs d*j are computed in 0(1) the complexity of the dynamic program 
is then reduced from n 2 to n 2 J2k=i(^k — T'k-i) 2 - We mention that this is an aposteriori 
bound indicating that SMUCE is faster for signals with many detected change-points than 
for signals with few detected change-point. In particular, if the estimated change-points 
are equidistant the bound n 2 /K is obtained. 

Finally, we note that for a practical implementation of the proposed dynamic program, 
the efficient computation of the values df rp > is essential. We address this in the upcoming 
section. The proposed algorithm is implemented for the statistical software R in the 
package stepi^j The SMUCE procedure for several exponential families is available via the 
function smuceR. 

3.1. Computation of minimal costs. Let r < i < j < p. Since {Fo} dee was assumed 
to be a regular, 1-dimensional exponential family, the natural parameter space G is a 
nonempty, open interval (^1,^2) with —00 < 9\ < 9 2 < 00. Moreover, the mapping 
9 1 — y J{y\,9) is strictly convex on O and has the unique global minimum at m -1 (>^) if 
and only if m~ 1 (Y'' i ) G int(0). In this case it follows from [53 Thm. 6.2] that for all q > 



2 R package available at http://www.stochastik.math.uni-goettingen.de/smuce 
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with — oo < 6^ < m~ 1 (F^) < b^ < oo. In other words, b^ and 6^- are the two finite 
solutions of the equation 



w-i + i)' ■ (32) 

If m~ 1 (Y J i ) int(0), then (67J Thm. 6.2] implies that either = — oo or by = oo. Let 
us assume without restriction that b^ = — oo which in turn shows that = (— oo, 82) and 



m -!(^) = -00. In this case, the infimum of 9 1— > J(y i ,6 l ) is not attained and ( p2[ ) has 
only one finite solution 6^-. The lower bound 63 ■ = —00 then is trivial. 
After computing b^ and fejj for all r < z < j < p, define B = max r < i < 5 <p b^ and 
= mm r <i<j< p bij. Hence, if dt p i < 00 we obtain 

X P if m-^Fj) > £ rp 

0^ = argmin d [r , p] (y,0) = { B rp if m" 1 ^) < B 



rp \ r I — =-rp 

m~ l (Y P r ) otherwise. 



Moreover, d* rp ^ = 00 if and only if B rp > B rp . 

To summarize, the computation of 9f rp , (and hence the computation of the minimal costs 
df rp ]) reduces to finding the non-trivial solutions of (32) for all r < i < j < p. This can 



either be done explicitly (as for the Gaussian family, for example) or approximately by 
Newton's method, say. 

3.2. Computation of confidence sets. The dynamic programing algorithm gives, in 
addition to the computation of the SMUCE, an approximation to the solution set C (q) of 



(24) as discussed in Section 2.6 The algorithm outputs disjoint intervals [r|, r£] as well as 
a confidence band B(q) C [0, 1] x 6 such that for each estimator -d G C(q): 

t h G [tItD for k = l,...,K(q) and (t, 0(t)) G B(q), for aU t G [0, 1]. 

To make this clear let 1 < k < K{q) and define 

Rk = max {r : I^Vl < k} and L fc = min [p : dL_R fc i < 00} . (33) 
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Then, for any estimator $ G S n [K{q)\ that satisfies T n (Y,i?) < q, it holds that f& G [t^,t^] 
with = n _1 Lfc and r£ = n~ 1 R k . 

Now we construct a confidence band B(q) that contains the graphs of all functions in 
C(q). To this end, let d be as above and note that for 1 < k < K(q) there is exactly 
one change-point in the interval [r l k ,Tjl} and no change-point in (t^,t^ +1 ). First, assume 
that t G (r|, 7^ +1 ). Then we get a lower and an upper bound for $(t) by B Rk+1L and 
-B_R fe +iL fc+1 -i, respectively. Now let t G [r|, r£]. Then, the fc-th change-point is either to the 
the left or to the right of t and hence any feasible estimator is constant either on [r l k ,t] or 
on [£,t£]. Thus, we obtain a lower bound by min {B_ Lk \ tn i, Br nt ^ Rk j and an upper bound 
by max{B LtiW ,B MA }. 



4. On the choice of the threshold parameter 

The choice of the parameter q in ^ is crucial for it balances data fit and parsimony of 
the estimator. First we discuss a general recipe that takes into account prior information 
on the true signal d. Based on this a specific choice is given in the second part which we 
found particularly suitable for our purposes. Further generalizations are discussed briefly. 



As shown in Corollary |2.9| for the general case, q determines asymptotically the level of 
significance for the confidence sets C(q). For the Gaussian case we have shown in Section 



274] that this result is even non-asymptotic, i.e. from Corollary |2.5| it follows that 

P(K(q) >K)< a(q), (34) 

where a(q) is defined as a(q) = P(M > q). This allows to control the probability of 
overestimating the number of change-points. If the latter is considered as a measure of 



smoothness, (34) can be interpreted as a minimal smoothness guarantee. This is similar 
in spirit to results on other multiscale regularization methods (see ES])- As argued 



in Section 2.6 in general it is not possible to bound the minimal number of change-points 
without further assumptions on the true function d (see also [25] in the context of mode 
estimation for densities). However, we can draw a sharp bound for the probability of 



underestimating the number of change-points from (22) in terms of the minimal interval 
length A and minimal feature size rj 2 = n\A 2 , which gives 

2 



(K(q) 



< K < 



A 



1 



exp 



V 



8 \2y/2 



Q 




+ exp 
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--: f3(q,r],X), 



where we have exploited the fact that K < 1/A. By combining (34) with the bound above 
one finds 

P (k{q) =K)>\- a(q) - /3(q, Tj, A). (35) 
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In order to optimize the bound on the probability of estimating the correct number of 
change-points, one has to balance the error of over- and underestimation. Therefore, we 
aim for maximizing the r.h.s. over q. Given A and rf = nXA 2 we therefore suggest to 
choose q as 

?L = max{l - a(q) - /%, 77, A)} . (36) 
' <?>o 



The explicit knowledge of the influence of A and 77 in (36 ) paves the way to various strategies 
for incorporating prior information in order to determine q. One might e.g. use a full prior 
distribution on (A, 77) and minimize the posterior model selection error 

maxE(l - a(q) - (3(q,r),X)) . 

q>0 

In the following we suggest a rather simple way to proceed, which we found empirically 
to perform quite well. We stress that there is certainly room for further improvement. 



Motivated by the results of Section 2A we suggest to define A and 77 = v nX A in dependence 
of n implicitly by the following assumptions 

(i) 7]* = 12 v /-log(A) and 

(ii) V / A^ = g{A,n), 



for some function g with values in (0, 1]. According to Theorem 2.7 , the first assumption 
reflects the worst case scenario among all signals that can be recovered with probability 1 
asymptotically. The second assumption corresponds to a prior belief in the true function 
In the following simulations we always choose g(A,n) = A which puts the decay of A 



and A on equal footing. We then come back to the approach in (36) and define 



q* n = max {1 - a{q) - (3(q, 77*, A*)} (37) 

<}>0 

where A* and 77* are defined by ^ and (JTTJ) . Consequently, the maximizing element g* 



picks that q which maximizes the probability bound in (35) of correctly estimating the 
number of change-points. In practise, g* is computed numerically, where a(q) is obtained 
by Monte-Carlo simulations. Note, that g* does not depend on the true signal but only 
on the number of observations n. 

Even though the motivation for g* is build on the assumption of Gaussian observations, 
simulations indicate that it performs also well for other distributions. That is why we 
choose g = g*, unless stated differently throughout all simulations. We stress again that 



the general concept given by (36) can be employed further to incorporate prior knowledge 



of the signal as will be shown in Section 5.6| 
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5. Simulations 

As mentioned in the introduction, the literature on the change-point problem is vast and 
we will now aim for comparing our approach within the plethora of established methods 
for exponential families. All SMUCE instances computed in this section are based on the 
optimization problem ([2]), i.e. we do not restrict the interval lengths, as required in Section 
[2] for technical reasons. 



5.1. Gaussian mean regression. Recall model (21 ) in Section 2.4 with constant variance 
a 2 and piecewise constant means u, i.e. we set 6 = \ija 2 and ip{6) = « 2 /(2cr 2 ) in (10). 
Throughout the following we assume the variance a 2 to be known, otherwise one may 
estimate it by standard methods, see e.g. [22] or (2 



Then, the MR-statistic (14) evaluated at ft e «S n [if] reads as 
T n (Y,fi) = max max 



o<k<k i k <i<j<i k+1 \ aV3 ~ i + 1 V + 1 



en 

2 log 



After selecting the model K(q) according to (17), the SMUCE becomes 

K(g) ^ _ 
fi{q) = argmin - h){Y - h) 2 s.t. T n (Y,p) < q. 

In our simulation study we consider the following change-point-methods. A large group 
follows the common paradigm of maximizing a penalized likelihood criterion of the form 

J(Y,#) -pen(tf) (38) 

over $ G S n [k] for k = l,...,n, where the function pen($) penalizes the complexity 
of the model. This includes the Bayes Information Criterion (BIC) introduced in [75] 
which suggests the choice pen($) = #J(i?) /21ogn. As it was for instance stressed in [91] . 
the formal requirements to apply the BIC are not satisfied for the change-point problem. 



Instead the authors propose the following penalty function in (38), denoted as modified 
BIC: 

pe n (tf) = _ 3 #J(^) logn + ^ log(r fc - r k ^) 

\ k=l 

They compare their mBIC method with the traditional BIC as well as with the methods 
in [68] and [36] by means of a comprehensive simulation study and demonstrated the 
superiority of their method w.r.t. the number of correctly estimated change-points. For 
this reason we only consider [91] in our simulations. In addition, we will include the 
penalized likelihood oracle (PLoracle) as a benchmark, which is defined as follows: Recall 



24 



MULTISCALE CHANGE-POINT INFERENCE 



that K denotes the true number of change-points and define ui and cu u as the minimal and 
maximal elements of the set 

respectively. In other words, for all u G [u;j,u; u ] the penalized maximum likelihood estima- 
tor obtained with penalty uk has exactly K change-points. We then define the PLoracle 



to be a maximizer of (38) with pen($) = ui*#J($) where 



uj* = median y— ^ — -J . (39) 

Of course, PLoracles are not accessible in practice (since K and $ are unknown). However, 
they represent benchmark instances within the class of estimators given by ( [38] ) . 
Moreover, we consider the fused lasso algorithm which is based on computing solutions of 

n 

minV(r i - d{i/n)f + Ai||tf|L + A 2 ||tf|| TV , (40) 

where denotes the Zi-norm and ||-|| TV the total variation semi-norm (see also |40j). 
The fused lasso is not specifically designed for the change-point problem. However, due to 
its prominent role and its application to change-point problems (see e.g. |83j), we include 
it into our simulations. An optimal choice of the parameters (Ai, A2) is crucial and in our 
simulations we consider two fused lasso oracles FL MSE and FL C_P . In 500 Monte Carlo 
simulations (using the true signal) we compute Ai and A2 such that the MISE is minimized 
for the FL M E and such that the frequency of correctly estimated number of change-points 
is maximized for FL C_P . 

In summary, we compare SMUCE with the modified BIC approach suggested in [91 j . the 
CBS algorithm^ proposed in (HE], the fused lasso algorithm^] suggested in [82J and the 



PLoracle as in (39). Since the CBS algorithm tends to overestimate the number of change- 
points the authors included a pruning step which requires the choice of an additional 
parameter. The choice of the parameter is not explicitly described in [68] and here we only 
consider the unpruned algorithm. 

We follow the simulation setup considered in [681 191"] . The application they bear in mind is 
the analysis of array-based comparative genomic hybridization (array-CGH) data. Array- 
CGH is a technique for recording the number of copies of genomic DNA (cf. [50J). As 
pointed out in [HE], piecewise constant regression is a natural model for array DNA copy 



R package available at http://craii.r-project.org/web/packages/PSCBS 
4 R package available at [http : //cran . r-pro j ect . org/web/packages/f lsa / 
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100 200 300 400 500 

Figure 3. True signal (solid line), simulated data, SMUCE (dashed line), 
confidence bands (grey hatched) and confidence intervals for the change- 
points (inwards pointing arrows) for a = (left), a = 0.01 (middle) and 
a = 0.025 (right). 



number data (see also Section 5.6.1). Here, one has n = 497 observations with con- 
stant variance a 2 = 0.04 and the true regression function has 6 change-points at loca- 
tions n = h/n and (l u . . . , l 6 ) = (138, 225, 242, 299, 308, 332) with intensities (0 O , . . . , 6 ) = 
(—0.18,0.08,1.07,-0.53,0.16,-0.69, —0.16). In order to investigate robustness against 
small deviations from the model with step functions, a local trend component is included 
in these simulations, i.e. 

Yi ~J\f(d(i/n) + 0.256 sin(avri), 0.04), i = l,...,n. (41) 

Following [91] we simulate data for a = (no trend), a = 0.01 (long trend) and a = 0.025 
(short trend) (see Figure [3]). 

Table [T] shows the frequencies of the number of detected change-points for all mentioned 
methods and a = 0,0.01,0.025 as well as b = 0.1,0.2 and the corresponding MISE. More- 
over, in Figure [4] we displayed typical observation of model (41) with a = 0.1 and b = 0.1 
and most considered estimators. The results show that the SMUCE outperforms the mBIC 
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CBS(p]) 


long 


0.000 


0.000 


0.844 


0.086 


0.070 


0.00166 


0.000 


0.000 


0.716 


0.142 


0.142 


0.00245 


FL C "P 


long 


0.071 


0.121 


0.180 


0.218 


0.410 


0.08703 


0.053 


0.087 


0.150 


0.215 


0.495 


0.08554 


pL MSE 


long 


0.000 


0.000 


0.000 


0.000 


1.000 


0.00323 


0.000 


0.000 


0.000 


0.000 


1.000 


0.00412 


SMUCE 
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0.000 


0.001 


0.992 


0.007 


0.000 


0.00119 


0.000 


0.002 


0.959 


0.039 


0.000 


0.00143 


PLoracle 


short 


0.103 


0.002 


0.895 


0.000 


0.000 


0.00212 


0.192 


0.000 


0.808 


0.000 


0.000 


0.00315 


mBIC ([91]) 


short 


0.000 


0.000 


0.940 


0.056 


0.004 


0.00124 


0.000 


0.000 


0.871 


0.117 


0.012 


0.00156 


CBS(p]) 


short 


0.000 


0.000 


0.821 


0.115 


0.064 


0.00149 


0.000 


0.000 


0.666 


0.190 


0.144 


0.00191 


FL C "P 


short 


0.109 


0.123 


0.159 


0.205 


0.404 


0.08908 


0.106 


0.120 


0.176 


0.211 


0.387 


0.08966 


pL MSE 


short 


0.000 


0.000 


0.000 


0.000 


1.000 


0.00321 


0.000 


0.000 


0.000 


0.000 


1.000 


0.00393 



Table 1. Frequencies of estimated number of change-points and MISE by 
model selection for SMUCE, PLoracle, mBIC [911. CBS [68J and fused lasso 
oracle FL MSE . The true signals, shown in Figure 3l have 6 change-points. 



[91] slightly in all scenarios and appears to be less vulnerable for trends, in particular. No- 
tably, SMUCE often performs even better than the PLoracle. In the comparison of SMUCE 
to FL M E and FL C_P one should account for the quite different nature of the fused lasso: The 
weight Ai in (40) penalizes estimators with large absolute values, while A2 penalizes the 
cumulated jump height. However, none of them encourages directly sparsity w.r.t the num- 
ber of change-points. That is why these estimators often incorporate many small jumps 
(well known as the staircase effect). In comparison to SMUCE one finds that SMUCE 
outperforms the FL MSE w.r.t the MISE and it outperforms FL C_P w.r.t. the frequency of 
correctly estimating the number of change-points. The example in Figure [4] suggests that 
the features of the true signal are recovered by FL MSE . But additionally, there are also 
some artificial features in the estimator. 

Again, we note that Table [l] can be complemented by the simulation study in [HI] which 
accounts for the classical BIC [7S] and the method suggested in 



5.2. Gaussian variance regression. Again, we consider normal data Yj, however, in 
contrast to the previous section we aim to estimate the variance a 2 G S. For simplicity 
we set /i = 0. This constitutes a natural exponential family with natural parameter 
6 = -(2a 2 )- 1 and ip(9) = -log(-20)/2 for the sufficient statistic Z { = Y 2 , i = 1, . . . , n. 
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is easily seen that the MR-statistic in this case reads as 
T n (Z,a 2 ) = max max 1 V " 

0<k<K l k <i<j<i k+1 
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Figure 4. An example of model (41) for a = 0.01 and b = 0.1. From top 



left to bottom right: SMUCE, mBIC, PLoracle, CBS, FL MSE and FL cp 
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After selecting the model K(q) according to (17), the SMUCE is given by 



K(q) 



a 2 (q) = argmax J^(4+i - 4) log 



'ii 



<J7 



s.t. T n (Z, a 2 ) < q. 



We compare our method to [2TJ |44] . Similar to SMUCE they propose to minimize the num- 
ber of change-points under a multiscale constraint. They additionally restrict their final 
estimator to coincide with the local maximum likelihood estimator on constant segments. 
As pointed out by the authors this may increase the number of detected change-points. 
Following their simulation study we consider test signals a k with k = 0,1, 4, 9, 19 equidis- 
tant change-points and constant values alternating from 1 to 2 (k — 1), from 1 to 2 (k — 4), 
from 1 to 2.5 (k — 9) and from 1 to 3.5 (k — 19). For this simulation the parameter of 
both procedures are chosen such that the number of changes should not be overestimated 
with probability 0.9. The frequencies for underestimation, correct estimation and overes- 
timation are given in Table [2] as well as the average number of estimated change-points 
and the MISE and MIAE w.r.t. the true signal. Considering the number of correctly esti- 
mated change-points, it shows that SMUCE performs slightly better for fewer changes and 
slightly worse for many changes. This may be explained by the fact that the multiscale 
test in [21] does not include a scale-calibration and is hence more sensible on small scales 
than on larger ones. With respect to MISE and MIAE the SMUCE outperforms in every 
scenario. 





SMUCE 


m 







1 


4 


9 


19 





1 


4 


9 


19 


avg 
under 
exact 

over 


0.1190 

0.0000 
0.8860 
0.1140 


1.0640 
0.0000 
0.9410 
0.0590 


4.0290 
0.0000 
0.9720 
0.0280 


9.0050 
0.0050 
0.9850 
0.0100 


18.7360 
0.2430 
0.7550 
0.0020 


0.1560 
0.0000 
0.8640 
0.1360 


1.1050 
0.0000 
0.9010 
0.0990 


4.0470 
0.0000 
0.9550 
0.0450 


9.0050 
0.0100 
0.9750 
0.0150 


18.9270 
0.0700 
0.9260 

0.0040 


MISE 
MIAE 


0.0007 
0.0198 


0.0066 
0.0445 


0.0214 
0.0792 


0.0616 
0.1289 


0.2584 
0.2726 


0.0008 
0.0193 


0.0090 
0.0472 


0.0341 
0.0964 


0.1171 
0.1842 


0.4664 
0.4027 



Table 2. Comparison of SMUCE and the method in [21]. Average number 
of estimated change-points, frequencies of underestimation, exact estimation 
and overestimation as well as MISE and MIAE for both estimators. 



5.3. Poisson regression. We consider the Poisson-family of distributions with intensity 
/i > 0. Then, 9 = log/z and ip{6) = exp#. The MR-statistic is computed as 



T n (Y, ft) = max _ max y/2(j - i + 1) \ Y\ log + fi h - Y\ - J 2 log 



0<fc<Kf fc <i<7<f fc+ i I V V j — i + 1 
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For K(q) as in (17), the SMUCE is given by 

p,(q) = argmax ^(4+i - l k )(Y£ +1 log fi k - fi k ) s.t. T n (Y, fi) < q. 

fieS n [K(q)] k=0 

In applications (c.f. the example from photoemission spectroscopy below), one is often 
faced with the problem of low count Poisson data, i.e. when the intensity /i is small. It 
will turn out that in this case, data transformation towards Gaussian variables such as 
variance stabilizing transformations are not always sufficient and it pays off to take into 
account the Poisson likelihood into SMUCE. 

In the following we perform a simulation study where we use a signal with a low count and a 
spike part (see top panel of Figure [5]). In order to evaluate the performance of the SMUCE 
we compare it to the BIC estimator and the PLoracle as described before. Moreover, we 
included a version of the SMUCE which is based on variance stabilizing transformations of 
the data. To this end, we applied the mean-matching transformation [TJ] to preprocess the 
data. We then compute the SMUCE under a Gaussian model and retransform the obtained 
estimator by the inverse mean-matching transform. The resulting estimator is referred to as 
SMUCE mm . Moreover, as a benchmark, we compute the (parametric) maximum likelihood 
estimator with K = 7 change-points, which is referred to as MLoracle. 





<5 6 
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8 >9 


MISE MIAE Kullback-Leibler 


SMUCE 
BIC 

SMUCE mm 


0.000 0.067 
0.000 0.000 
0.013 0.420 


0.929 

0.080 
0.561 


0.004 0.004 
0.094 0.920 
0.005 0.006 


0.274 0.217 0.0187 

0.575 0.313 0.0417 
0.434 0.364 0.0418 


PLoracle 
MLoracle 


0.045 0.014 
0.000 0.000 


0.942 
1.000 


0.000 0.000 
0.000 0.000 


0.275 0.217 0.0185 
0.258 0.208 0.0143 



Table 3. Frequencies of K and distance measures for SMUCE, the BIC 
[75] , the SMUCE for variance stabilized signals as well as the PLoracle and 
MLoracle. 



Table [3] summarizes the simulation results. As to be expected the standard BIC performs 
far from satisfactorily. We stress that SMUCE clearly outperforms the SMUCE mm , which 
is based on Gaussian transformations. Note, that the SMUCE mm systematically underesti- 
mates the number of change-points K = 7 which highlights the difficulty to capture those 
parts of the signal correctly, where the intensity is low. Again, SMUCE performs almost 
as good as the Poisson-oracle PLoracle. To get a visual impression along with the results 
of Table |3j we illustrated these estimators in Figure |5j 
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Figure 5. from top to bottom: simulated data, true signal, SMUCE with 
confidence bands for the signal intensities (gray area) and confidence intervals 
for the change-points (inward pointed arrows), SMUCE mm and Ploracle. 



5.4. Quantile regression. In a final simulation scenario we employ the multiscale statistic 
for quantile regression. Let the observations Y±, . . . , Y n be given by model Q, without the 
assumption that the distribution belongs to an exponential family. Instead of estimating 
the mean value function $ we now aim for estimating the corresponding /3-quantile function. 
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This problem can be traced back to Bernoulli regression as follows: Given i.i.d. observations 
Zi, . . . , Z m and their /3-quantile ( define the random variables. W±, . . . , W m by 



,n. 



ifZi<C 

, i = l 

1 otherwise 

Then, W\, . . . , W n are i.i.d. Bernoulli observations with mean value (3. Generalizing this 
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Figure 6. First row: block signal (left) and simulated data (right). Second 
row: Estimator for median (solid), 0.1 and 0.9-quantiles (dashed) according 
to SMUCE (left panel) and generalized taut string (right panel). 



idea, the Bernoulli multiscale statistic can be employed to estimate change-points in the 
quantile function. We follow the simulation set up in [30J and choose 

Yi ~ Cau(#(z/n), 0.4), i = 1, . . . , 2048, 

where Cau(Z, s) denotes the Cauchy distribution with location /, scale s, and •& is a rescaled 
version of the standard test signal blocks from [27] (see Figure [6]). 
In [30] an extension of the taut string algorithm [22] has been suggested for estimating 
The authors showed in a simulation study, that their method is particularly suitable to 
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gen 


. taut string 


SMUCE 


n\p 


0.5 


0.1 


0.9 


0.5 


0.1 


0.9 


512 
2048 
8192 


3 (6.0) 
9 (0.9) 
9 (0.0) 


1 (7.4) 
4 (4.7) 
6 (3.4) 


(8.4) 
3 (5.5) 
5 (4.1) 


3 (6.1) 
9 (0.4) 
9 (0) 


2 (7.2) 
4 (5.2) 
6 (3.2) 


2 (7.1) 

4 (4.9) 

5 (4.2) 



Table 4. Comparison of SMUCE and generalized taut string [3D]. Median 
of local extremes of the estimator and mean absolute difference (in brackets) 
to true number of local extremes. The true number of local extremes equals 
9. 



detect local extremes of a signal. We follow this idea and repeated their simulation study. 
The results are shown in Table HI It can be seen that the difference between both methods 
is rather small. This is remarkable, since SMUCE is not primarily designed to optimize 
the number of local extremes. 



5.5. On the coverage of confidence sets I(q). In Section 2.6 we gave asymptotic 



results on the simultaneous coverage of the confidence sets I(q) as defined in (27). In our 



simulations we choose q = q\- a to be the 1 — a-quantile of M as in (15). It then follows 



from Corollary |2.10| that asymptotically the simultaneous coverage is larger than 1 — a. We 
now investigate empirically the simultaneous coverage of I(qi_ a ). To this end, we consider 
the test signals shown in Figure [JJ for Gaussian observations with varying mean, Gaussian 
observations with varying variance, Poisson observations and Bernoulli observations. 
Table [5] summarizes the empirical coverage for different values for a and n obtained by 
500 simulation runs each and the relative frequencies of correctly estimated change-points, 
which are given in brackets. The results show that for n = 2000 the empirical coverage 
exceeds 1 — a in all scenarios. The same is not true for smaller n (indicated by bold letters), 
since here the number of change-points is misspecified rather frequently (see numbers in 
brackets). Given K has been estimated correctly, we find that the empirical coverage of 
bands and intervals is in fact larger than the nominal 1 — a for all simulations. 

5.6. Real data results. In this section we turn to real data examples. The examples 
show the variety of possible applications for SMUCE. Moreover, we revisit the issue of 
choosing q as proposed in Section [4] and illustrate it's applicability to the present tasks. 

5.6.1. Array CGH data. Array Comparative Genomic Hybridization (CGH) data show 
aberrations in genomic DNA. The observations consist of the log-ratios of normalized 
intensities from disease and control samples. The statistical problem at hand is to identify 
regions on which the ratio differs significantly from (which corresponds to a gain or a 
loss). These are often referred to as aberration regions. A thorough overview of the topic 
and a comparison of several methods is given in [57]. We compute the SMUCE for two 
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Figure 7. f.l.t.r.: Gaussian observations with varying mean, Gaussian ob- 
servations with varying variance, Poisson and (binned) Bernoulli observa- 
tions and SMUCE (solid) with confidence bands (grey hatched) and confi- 
dence intervals for change-points (inwards pointing arrows). 



data sets studied in |57] and more recently in [281 183]. The data sets show the Array-CGH 
profile of chromosome 7 in GBM29 and chromosome 13 in GBM31, respectively (see also 

EHJE7]). 

By means of these two data examples we illustrate how the developed theory in Section 
[2] can be used for applications. As it was stressed in [57J many algorithms in change- 
point detection do strongly depend on the proper choice of a tuning parameter, which is 
often a difficult task in practice. We point out that our proposed choice of the threshold 
parameter q has in fact a statistical meaningful interpretation as it determines the level of 
the confidence set C(q). Moreover, we will emphasize the usefulness of confidence bands 
and intervals for Array CGH data. 

We first consider the GBM29 data. In order to choose q according to the suggested pro- 
ceeding in (36), assumptions on A and A have to be imposed. We have chosen A > 0.05, 
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n 


1 - a 


Gaussian 
(mean) 


Gaussian 
(variance) 


Poisson 


Bernoulli 


1000 


0.8 
0.9 
0.95 


0.59 0.64 0.92 
0.48 0.49 0.98 
0.28 0.28 1.00 


0.66 0.68 0.97 
0.39 0.39 1.00 
0.16 0.18 0.93 


0.87 0.89 0.98 
0.85 0.86 0.99 
0.71 0.74 0.96 


0.85 0.90 0.94 
0.86 0.86 0.99 
0.66 0.70 0.94 


1500 


0.8 
0.9 
0.95 


0.84 0.90 0.93 
0.73 0.74 0.98 
0.55 0.56 0.98 


0.87 0.88 0.98 
0.72 0.74 0.97 
0.45 0.47 0.98 


0.92 0.95 0.96 
0.95 0.97 0.98 
0.92 0.93 0.99 


0.93 0.97 0.96 
0.96 0.97 0.99 
0.89 0.90 0.99 


2000 


0.8 
0.9 
0.95 


0.94 0.99 0.95 
0.98 1.00 0.98 
0.99 1.00 0.99 


0.98 1.00 0.98 
0.99 1.00 0.99 
0.97 0.99 0.98 


0.95 0.99 0.95 
0.96 0.99 0.96 
1.00 1.00 1.00 


0.96 0.99 0.97 
0.97 0.99 0.98 
0.99 1.00 0.99 



Table 5. Empirical coverage obtained from 500 simulations for the signals 
shown in Figure [7} For each choice of a and n we computed the simultaneous 
coverage of I(q), as in (27) (first value), the percentage of correctly estimated 



number of change-points (second value) and the simultaneous coverage of 
confidence bands and intervals for the change-points given K(q) = K (third 
value). 
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Figure 8. Left: Probability for over/underestimating the number of 
change-points (dashed lines) and their sum (solid line). Top right: Detected 
change-points for different values of q. Bottom right: SMUCE computed for 
optimal q*. 



which implies K < 20, and A > 4, i.e. we expect relatively large changes on small seg- 
ments. In the left panel of Figure [9] we depict the probability of overestimating the number 



MULTISCALE CHANGE-POINT INFERENCE 



35 



of change-points as a function of q (decreasing dashed line) and the probability of overes- 
timating the number of change-points as a function of q (increasing dashed line) under the 
above stated assumption on A and A. 

Moreover, for different choices of q we compute the SMUCE and the corresponding bounds 
for the error terms. The top-right panel of Figure [8] shows the estimated change-points 
with its confidence intervals. Bounds for the probability that K was overestimated can 
be found on the left axis, bounds for underestimation on the right axis. First note from 
the top-right image in Figure [8] that the SMUCE is quite robust w.r.t. q = q±- a - For 
a G [0.1,0.6] SMUCE always detects exactly 7 change-points in the signal. 
We finally choose q optimal according to (37) such that the sum of both bounds is minimal 
(shown by the dashed vertical line in the left panel). This leads to the choice q = 4.8 and 
a « 0.2. The corresponding estimate is shown in the lower-right panel of Figure [8j Recall 
that the main goal of Array CGH data analysis is to determine segments on which the 
signals differs from 0. The confidence sets in the right lower plot indicate three intervals 
with signal different from 0. Moreover, as indicated by the blue arrows, the change-point 
locations are detected very precisely. Actually, the estimator suggests one more change- 
point in the data. However, it can be seen from the confidence bands that there is only 
small evidence for the signal to be nonzero. 

Put differently, not only an estimator for the true signal is obtained, but also 3 regions of 
abberation were detected and simultaneous confidence intervals for the signal's value on 
this regions at a level of a = 0.9 are given. This is in accordance with others' findings 
[28j [57] and the existence of these three regions can now be confirmed a level of a = 0.9. 
The same procedure as above is repeated for the GBM31 data as shown in Figure [9j For 
the bounds on underestimating the number of change-points we have assumed that the 
function has at most two jumps with minimal segment length A > 0.3 and minimal change 
of A > 0.4. Hence, in contrast to the previous example we focus on finding few small 
changes on large segments. Using the same reasoning as above we identify one large region 
of abbreviation and obtain a confidence interval for the corresponding change-point as well 
as for the signal's value. Here, the optimized q in the sense of (37) gives a ~ 0.12 which 
yields a SMUCE with one jump. 



5.6.2. Photoemission Spectroscopy (PES). Electron emission from nanostructures triggered 
by ultrashort laser pulses has numerous applications in time-resolved electron imaging and 
spectroscopy [73] . In addition, it holds promise for fundamental insight into electron corre- 
lations in microscopic volumes, including antibunching [53]. Single-shot measurements of 
the number of electrons emitted per laser pulse [HI E] will allow for the disentanglement of 
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Figure 9. Left: Probability for over/underestimating the number of 
change-points (dashed lines) and their sum (solid line). Top right: Detected 
change-points for different values of q. Bottom right: SMUCE computed for 
optimal q. 



various competing processes governing the electron statistics, such as classical fluctuations, 
Pauli blocking and space charge effects. 

We investigate with the SMUCE approach PES data displayed in the bottom panel of 



Figure [10} It represents a time series of electron numbers recorded from a PES experiment 
performed in the Ropers lab (Department of Biophysics, University of Goettingen, see 
[S]). It is custom to model PES data by Poisson regression with unknown intensity. This 
intensity is known to show long term fluctuations which correspond to variation in laser 
power and laser beam pointing, which cannot be entirely controlled in the experiment. 
However, on a short time scale, a challenging task is to investigate underdispersion in the 
distribution. Such underdispersion would indicate an electron interaction in which the 
emission of one (or a few) electrons decreases the likelihood of further emission events. 
Specifically, a significant underdispersion in the single-shot electron number histogram 
would evidence an anticorrelation caused by electrons being Fermions that obey the Pauli 
exclusion principle. A piecewise constant mean that models sudden changes in the laser 
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Figure 10. Top: Detected Change-points and confidence intervals for dif- 
ferent values of a. Middle: SMUCE with confidence bands and binned PES 
data. Bottom: ML-Estimator with 10 change-points. 



intensity to reflect the large scale fluctuations is used for segmentation of the data for 
further investigation of under- or overdispersion in these segments. 

Figure 10 shows the estimated change-points of SMUCE (and the corresponding confidence 
intervals) for a = 0.05,0.1,. . . ,0.9 in the top panel. We also display the SMUCE with 
confidence bands for a = 0.9 (middle) and for comparison the MLE with A'smuce(q') = 10 
change-points (bottom). Note, that the MLE is computed without the additional constraint 
T n (Y,/t) < q, in contrast to SMUCE. Remarkably, this results in a different estimator. 
We estimate the dispersion of data Yi, . . . ,Y m by p — <7 2 /£i, where fi = I/jtoJ^Li ^ anc ^ 
a = l/fnY^iLi(Y — fa) 2 - In Table [6] /t = l/wi^^Li ^ ^ s sriown f° r the whole dataset as well 
as for the segments identified by SMUCE. It can be seen that our segmentation allows to 
explain the overall overdispersion to a large extent, by the long term fluctuations. However, 
the results in Table [6] do not indicate significant underdispersion on any of the identified 
segments. This may be explained by a masking effect due to fluctuations of the emission 
current. Future experiments using more stable emission currents are underway. 
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0.98 


1.04 


1.01 


1.04 


0.98 1.03 
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Table 6. Dispersion estimator p of the whole dataset and on the segments 
identified by SMUCE 

6. Discussion 

6.1. SMUCE from a linear models perspective. For normal mean regression one may 



rewrite the change-point regression model in (21) as a linear model 



Y = X/3 + ae, 

where = — denotes the jump heights. If we add a vector of ones and a coefficient 
Po to define the offset of the function, then X is an {n x n) upper triangular matrix with 
entries X it j — 1, i > j and zero else. Hence, in the terminology of high dimensional linear 
models, we have an "n = p" problem in contrast to the "p 3> n" situation which has 
perceived enormous attention during the last two decades. If we rescale by 1/ y/n, then we 
find that X l X/n = min(i, j)/n tends to the covariance function of a standard Brownian 
motion. From this limiting covariance it becomes immediately clear that assumptions 
like the restricted isometry property and related conditions (see [131 El ES]) fail without 
additional restrictions, e.g. sparsity assumptions on the jump locations. For a thorough 
discussion see the Appendix in [40j Roughly speaking, these assumptions guarantee that 
estimators which are based on minimizing £o(/3), i.e. the number of jumps, can be obtained 
by the £i(/3) surrogate with large probability. This may be taken as a rough explanation for 
the fact that TV and l\ penalization method do not perform competitive in a multiscale 
framework to some extent for estimating location and number of change-points. This is 
supported by our simulations, as they built in too many little jumps. 

6.2. Risk measures. SMUCE aims to maximize the probability of correctly specifying 
the number of jumps P(K = K) uniformly over sequences of models such that A n A^ tends 
to zero not as fast as logn/n. This is conceptually very different from optimizing d w.r.t. 
convex risk measures such as the mean squared error and related concepts. The latter 
measures do not primarily target on the jump locations and number of jumps. Therefore, 
we argue that in those applications, where the primary focus is on the jump locations 
SMUCE may be advantageous. We mention NMR spectroscopy (see [35]) and [15] for 
an application in membrane biophysics. In fact, maximizing the probability of correctly 
estimating the number of jumps as SMUCE advocates has some analogy to risk measures 
for variable selection problems, shown to perform adequately successful in high dimensional 
models. This includes the false discovery rate (FDR) [3] and related ideas (see e.g. [39J). 
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Whereas in our context the latter ones aim to minimize the expected relative number of 
wrongly selected change-points, SMUCE is able to give at the same hand a guarantee that 
the true change-points will be detected with large probability and hence controls the false 
acceptance rate (FAR) as well. 

6.3. Computational costs. In [31] the authors showed that their pruned exact linear 
time method leads to an algorithm which expected complexity is linear in n in some 
cases. As stressed in Section [3j our algorithm includes similar pruning steps. Due to the 
complicated structure of the cost functional, however, it seems impossible to prove such a 
result for the computation of SMUCE. The computation can, of course, be further reduced 
significantly if e.g. only intervals of dyadic lengths are incorporated into the multiscale 
statistic. Since the dynamic approach leads to a recursive computation, SMUCE can be 
updated in linear time, if applied to sequential data. Another interesting strategy to 
reduce the computational costs could be adapted from [721 ES] who suggest to restrict the 
multiscale constraint to a specific system of intervals of size 0(ri) which still guarantees 
optimal detection. 



6.4. The choice of a. We have offered a strategy to select the threshold q = q a and 
hence the confidence level a in a sensible way to minimize P(K ^ K), by balancing the 
probabilities of over- and underestimation of K, simultaneously. This is based on the 
inequalities in Section [4] depending on A, A and n. As indicated in Figures Tf8|9 and 



10 this can be used to consider the evolution of SMUCE depending on a as a universal 



"objective" smoothing parameter. The features (jumps) of each SMUCE given a then may 
be regarded as "present with certain confidence" similar in spirit to ideas underlying siZer 
(see [ITJUH]). It is striking that in many simulations we found that features (jumps) remain 
persistent for a large range of levels a. Of course, other strategies to balance P(K(q) > K) 
and P(K(q) < K) are of interest, e.g. if one of these probabilities is considered as less 
important. For a first screening of jumps, P(K(q) > K) is the less serious error and 
P(K(q) < K) should be minimized primarily. This can be achieved by optimizing the 
convex combination SP(K(q) > K) + (1 — 5)P(K(q) < K) for a weight 5 close to 1 along 
the lines described in Section HI 
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7. Supplement to "Multiscale change-point inference" 

In this supplement we collect the proofs of the main assertions in the paper together with 
some auxiliary lemmas. We further give more general versions of some results in the paper. 

7.1. Large deviation and power estimates. We begin by recalling some large deviation 
results for exponential families. By D{9\\9) we will denote the Kullback-Leibler divergence 
of Fa and Fs, i.e. 



D(e\\§) 



fo(x) log 



fe(x) 



dv{x) = ip(9) - tp(9) -(§- 6)m(6). 



(42) 



With the techniques used in [93j Thm.7.1] it is readily seen that for a sequence of inde- 
pendent and Fe-distributed r.v. Y±, . . . , Y n one has that 



P (Y - m(9) >r])<e 



n(D(8\\8+s)-rie) 



(43) 



for all e > such that 9 + e G 0. The following restatement of inequality (43) turns out 
to be very useful. 

Lemma 7.1. Let Y = (Y\, . . . ,Y n ) be independent random variables such that Yj, ~ Fg 
and assume that 5 > is such that 6 + 5 G 6. Then, 

P(m- 1 (F) > 6 + 5) < e - nD( - e+s W e \ 



Proof. First observe that according to ( 43 ) 

P(m _1 (F) > 6 + 5) = P(F - m(6) > m(6 + 5)- m(8)) 

< exp(n(D(9\\6 + 5)- (m(9 + 5)- m{6))8)). 



Now it follows from (42) that 

D(0||0 + 5) - (m(6 + 5)- m{9))5 = ip(9 + 5)- ip(9) - m{9 + 5)5 

= -(V>(0) - 1>(9 + 5) -(9 -(9 + 5))m(9 + 5)) 
= -D(9 + 5\\9). 

From (43) we further derive a basic power estimate for the likelihood ratio statistic (|4]). 



□ 



Lemma 7.2. Let Y = (Yj_, . . . ,Y n ) be independent random variables such that Y$ ~ Fg 
and assume that 5 G K is such that 9 + 5 G 0. Then, 

P (T?(Y, 9 + 5)>q)>l-exp(n inf \d(9\\9 + e) - %D(9\\9 + 5) + 

V ee[o,<5] L o nd 
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Proof. For 



we obtain 



Thus, we have 



J(y, 9) = 4>{Y) — {Y9 — ip(6)) 
j(Y, 6 + 5) = J(Y, 9)-8Y- ip(9) + ip(9 + 5). 

U(q,n,5) :=P(T?(Y,6 + 5)>q) 

n/ 

q 



(44) 



p[j(X,e + s) > 



p 



SY > - -MO + 8) +ijj(9) 
n 



> P (-SY > - -ip(9 + 5) +ij}{9) 



n 



where in the last inequality holds since J(x, 9) > for all x G K and 9 G 0. Now, let us 

(45) 



first assume that 6 > 0. Then by (42) we find 



q 



-SY > - - M9 + 5) + ip(9) 
n 



on o 



Combining this with the large deviation inequality (43) yields 



1% n, 5) > 1 - exp (n(D(9\\9 + e) - ^D{6\\6 + 6)) + 
for all < £ < 5. The case when 5 < follows analogously. 
For Gaussian observations the estimate can be made explicit. 



eq 
8 



□ 



Lemma 7.3. Let Yi, . . . ,Y n be i.i.d. random variables such that Y\ ~ A/"(0, 1) and let 
x + = max(0, x) for x G 1R. Then, 

P (T"(y, 8)>q)>l- exp (-± (y08 - V^q) 2 ) ■ (46) 



Proof. Since D(9\\9 + e) = e 2 /2 we find that 



inf n 

ee [0,5] 



if > v^g 



□ 



7.2. Proof of Theorem 2.1 , Throughout this section we will assume that Y = (Y"i, . . . , F n ) 
are independent and identically distributed random variables with Y\ ~ and 9 G 0. 
Without loss of generality we will assume that m(9) = ip(9) = and v (9) = ip(9) = 1. 



Moreover, assume that (c n )„ g N satisfies (13) and introduce I(c n ) = ■ j — i + 1 > c n n}. 
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We start with some approximation results for the extreme value statistic of the partial sums 
Yl 

Lemma 7.4. There exist i.i.d standard normally distributed r.v. Z\, . . . , Z n on the same 
probability space as Yi, . . . , Y n such that 



lim \/log n max 

n->oo (i,j)<El(c n ) 



j-i + 1 



m\ 



a.s. 



Proof. We define the partial sums S$ = and Sf = Y\ + . . . + Y\ and observe that 
(j — i + 1) | Y\ J = | Sj — Sj_ x | . Analogously we define Sf . Now let such that j — i + 1 > 
nc n and observe that 



CY QY cz cz \ 

°j \ J j J i-1\ 



< 



I cY qZ\ 

\°j D i I 



QY Q 



I cY qZ\ 
I I \ D l °M 

< z max 



0<K 



It follows from the KMT inequality [95J Thm. 1] and (13) that 







ogn max 



I cY qZ\ 

\°i °i \ 



0<l<n y/nc r , 



o(l) a.s. 



□ 



Lemma 7.5. 



max 

(ij)6^(Cn) 



27? (y,. 



Proof. Set £ = m -1 and note that £ is strictly increasing. Since G is open, there exists 
for each given 5' > a 5 > such that £(-85(0)) C Bs>(9) C 0. Next define the random 
variable 

ri 



max 

l<i<j<n 



Y 



Then it follows from Shao's Theorem [96J that L n J \A°g n converges a.s. to some finite 
constant and we hence find that 



max 

(ij)ex(c„) 



Y 



, Togn L n 

< \ I , — > a.s. 



nc n yTog n 

Thus, for each e > there exists an index n = no(e) G N such that for all n> n 



P | max 

,(M')eX(e n ) 



> 5} <£. 



In other words, £(Yj) G B e {6) uniformly over X(c n ) with probability not less than 1 — e. 
Consequently, 4>(Y\) = max^ge 0Y\ — if) (9) = ^iY\)Yl — ^(£(Y^)) which in turn implies 
that 



Jty'i, 



- oy\ + m = (e(yj) - e)Y\ - m(YD) - m)- 
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Taylor expansion of if) around 9 gives (recall that ip{9) =0 and if} (6) = 1) 

myi)) - m = - o? + ^mm) - of 

for some 9 G B £ (9). This implies 

J(Xl 9) = (£(Fj) - 9)(Y J l ) - I(eCFj) - 9f - ^~)(£(F^) - £) 3 . 

Again, Taylor expansion of £ = m -1 around shows 

'#{§) 



: 3\2 



for some 9 G B&(9). This finally proves that 

2Ti(Y, 9) = (j - i + 1) J(FJ, 0) = (j - i + l)(Y\f + + l)f n (Yi) 

where /„ is such that |/ n (y^)| < C 2 ■ {Y\Y for a constant C = C(S') > (independent of 
e, i and j) and for all n > n Q . It thus holds with probability not less than 1 — e that 



max 



2T>(Y,e*)-y/j-i + l\Yl 



<C max 

(i,i)ex(c„) 



-C max 

(i,i)ex(c„) 



(j-z + l)(F 

EL* 



1/2 



vO' - * + 1 



(j - z + 1) 



-1/6 



3/2 



<c 



3/2 



' log 3 n 



Vlogn / y nc r , 

From Shao's Theorem it follows that the last term vanishes almost surely asn-f oo. □ 



Combination of Lemma 7.4 and 7.5 yields 



Proposition 7.6. There exist i.i.d standard normally distributed r.v. Z\,...,Z n on the 
same probability space as Y\, . . . ,Y n such that 



max 

(i,i)6X(Cn) 



2T!(Y,8)-y/j-i + l\Zi 



o P (l). 



Lemma 7.7. For n G N, define the continuous functionals h, h n : C([0, 1]) — > R by 

\x(t) — x(s)\ 



h(x, c) 
h n (x,c) 



sup 

0<s<t<l 
t— s>c 



max 



2 log 



t - s 



and 



\x(j/n) — x(i/n)\ 



l<i<j<n \ /(j + 
(j-i+l)/n>c \ V VJ II 



n 



en 



j-i + 1 
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respectively. Moreover assume that {x n } n( - N C C([0, 1]) is such that x n — > x for some 
x G C([0, 1]). Then h n (x n , c) — )■ c). 

Proof. Let 5 > 0. Then there exists an index n £N such that \x n (t) — x(t)\ < 5 for all n > 
n and t G [0, 1]. Thus, it follows directly from the definition that h n (x) = h n (x n ) + 0{5) 
for n > hq. Since u (->■ a/ 2 log e/tt is uniformly continuous on [c, 1] we consequently have 
that h n (x) — > h(x) as n — > oo and the assertion follows. □ 



Before we proceed, recall the definition of M in (15). Moreover, we introduce for < c < 1 
the statistic 

M(c):= sup (!*M^-J^T). (47) 
o<s<t<i V y/t — s V t — sj 

t—s>c 

From [9U Thm. 6.1] (and the subsequent Remark 1) it can be seen that M(c) converges 
weakly to M as c — > + . 

Proposition 7.8. Let c > and define 



TtfYJ) = max ^21^,0) - ./21og^^ 
TTien lim c _ s>0 + lim n -s-oo ^(Y, ^) = ^ weakly. 

Proof. Set So = and = Yj. + . . . + Y n and let {X n (t)} t>0 be the process that is linear 
on the intervals [i/n, (i + with values X n (i/n) = Si/y/n. We obtain fro m Donsker's 
Theorem that X n — > B. Now, recall the definition of h and h n in Lemma 7.7 and observe 
that 



h n (X n ,c)= max ( a/j - % + l|Y- 1 - A /21og 



en 



(i,j)ex(c) V ' V j-i + 1 

It hence follows from Lemma [7.51 that 



\TZ(Y,6)-h n (X n ,c)\< max 

(i,i)ez(c) 



277(7,^-^-^ + 11^1 



o P (l). (4* 



Since X„ — )■ 5, Lemma 



7.7 



and [61 Thm. 5.5] imply that h n (X n ,c) -4 h(B,c). Theorem 
4.1 in [6] and Q thus imply that T°(Y,9) 4 c) = M(c) as n ^ oo for all c> 0. 
Thus, the assertion finally follows, since M(c) — > M weakly as c — >■ + □ 

Theorem 7.9. Lei K = (Y\, . . . ,Y n ) be independent and identically distributed random 
variables with distribution Fg, 9 G O. Moreover, assume that {c n } ngN is a sequence of 
positive numbers such that n' 1 log 3 n/c n — > and set 



T n{ Y A c) = ( _ ma t) (\/2I7(y,») - ^logj^Vl 
TTien, T n (Y, 0, c n ) — )■ M weakly as n — >■ oo. 
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Proof. First observe that according to Proposition |7.6| we have for alH > that 

P(T n (Y,9-c n )<t) = P{ max ( v 'j , + 1 1 Z\ \ J 2 log 

(i,j)ez(cn) 



> P ( sup 

,0<s<t<l 



B(t)-B(s)\ 



en 



2 log 



t - s 



< t I oil 



< t +o(l) 



This shows that for all t > 

liminf P(T n (Y,6,c n ) <t) > P(M <t) 
Now let c > be fixed and assume w.l.o.g. c n < c for all n e N. With as defined in 



Proposition |7.8| we conversely find 

limsupP(T n (y,fl,c n ) <t)< limsupP(T^(r,#,c n ) < t) = P(M(c) < t). 



Hence the assertion follows from Proposition 7.8 after letting c — > + and the fact that 
M > a.s. □ 



Proof of Theorem 2.1. Let T n (Y, $;c n ) be defined as in (14). From Theorem 7.9 it then 
follows that 

\B(t) - B{s) 



v 



T n (Y, $; Cn) — >■ max sup 

0<k<K Tk <s<t<T k+1 



a / 2 log 



y/t — S V " * — S , 

Clearly the limiting statistic on the right hand side is stochastically bounded from above 
by M. Conversely, we observe by the scaling property of the Brownian motion that 

\B(t)-B(s)\ 



sup 

T k <S<t<T k+ l 



V 



sup 

0<s<t<l 



B(t) - B(s) 



2 log 



t - s 



2 log 



t-s 



+ 2 log 



V 

> M 



21og- 



□ 



7.3. A general exponential inequality. In this section we give a general exponential 
inequality for the probability that SMUCE underestimates the number of change-points. 
To this end, we will make use of the functions 

"£ 



K,f(v : w : x : y)= inf sup 
v<e< w ee[0:X] 



IX 



(D(6\\9±x)-y)-D(6\\6±e) 



nf(v,w,x) 



inf D(6±x\\6). 

v<9<w 
0±xd\v ,w\ 



(49) 
(50) 
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Theorem 7.10. Let q G R and K(q) be defined as in (17). Moreover, assume that nf and 
Acf are defined as in (49) and (50), respectively and set 

( 



K\ = mm < kJ 



9,9 



nX 



V 



/ 



9,9 



A (?+ "\/21og 



' ' 2 ' 



V 



and 



/ 



'•2 Him <J i;j { 9, 9, — J , k 2 ( 0, 0, — 



7/ A > 2c n , i/ien 



P (A(g) <K) <2K 



e a - -f e a - 



(51) 



Proof. Let A and A be the smallest jump size and the smallest interval length of the true 
regression function i.e. 



A 



inf \9 h 



Kk<K 



? fe _a and A= inf r fe+1 - r fe . 

0<k<K 



Now define A" disjoint intervals Jj = (t^ — A/2, 7} + A/2) C [0, 1]. Let 0/ = maxj^!,^}, 
4 ~ = min 0j} and split each interval Ij accordingly, i.e. If = {t G 7^ : = 0+} 

and ir = {teli: 0(t) = 0~}. Clearly J^/rU /+. 
From the definition of the estimator A(g) it is clear that 

< A" 3^ G S n [K - 1] such that T n (Y, 0) < q. 

If d G «S n [JC — 1], then there exists an index k G {1, . . . ,K} such that $ is constant on 

<^and J^iYj)- 



U Let n k = 139 G 6 : JT I+ (Y,9) - /lo. 



#'2 



l p- < _2_ 



Since the K intervals are disjoint we find 

P(K(q)<K)< 



A" 



fe=l 



If G <S n [-K" — 1] is constant on some I k with value 9, then either < 9f — A/2 or 
> fc + A/2, by construction. Set 



ft; 



30<0+-A/2: JT, + (Y,0) 



en g 

los #^-7I 



30 > 0^ + A/2 : JT z -(Y, 



en q 



and observe that P(ftfc) < P(ft^ ) + P(ft fc ). We proof an upper bound for P(fi^), the same 
bound can be obtained for P(ft^) analogously. Recall that i-> T 7 -(Y, •) is convex and 
has its minimum at m~ 1 (y j -). Thus, T 7 -(Y, 0) > Tj-(Y, 0^ + A/2) whenever m -1 (Yj-) < 
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^ + A/2. This yields 

p (n k ) < p (a- n \m-\Y,-) < e k + 1 



p m- i (r /r )>^- 



A 



< 1 - P (t 7 - (y, fc + y) > 1/2 (q + v / 2W)) 2 ) 



P [m-\Y IZ )> 



A 



< 



exp 



V 



An . . 
— ml 

2 ee[0,A/2] 



V 



A/2 



D{6 k \\e- k + A/2) 



2e (q+ v/21og(2e/A) 
A An 



An 



+ ^l>(-^£>(0 fc - + A/2||0- 



/ 



< exp 



nA , 

«i 

2 1 



( - a (? + V 21o g( 2e A)) 2, \ 



V 



0, 0, 

-' ' 2 



V 



An 



/ 



\ 



/ 



+ exp 



nA 



A 



by Lemma 7.1 and Lemma 7.2 With the definition ol the constants Kj as in the Theorem 



(J = 1,2) we eventually obtain 

P(K(q) <K)<2K 



exp 



+ exp 



□ 



The constants nf (i = 1,2) basically depend on the exponential family J- '. Their explicit 
computation can be rather tedious and has to be done for each exponential family sepa- 
rately (for the Gaussian case see see below). Therefore, it is useful to have a lower bound 
for these constants. 



Lemma 7.11. Let v be as in (11 ) and nf and be defined as in (49) and (50), respectively. 
Then, 



K,f(v,w,x,y) > 



x 2 mi v<t<w v(t) 2 ± x 2 . 

— y and ft? (v, w.x) > — ml v(t). 

4 snp v<t<w v(t) 2V >~ 2 v<t< w U 



Proof. First observe from (42), that for any 9 G G and e > such that 9 + e G G one has 



D(9\\9 + e) = Jg(9 + s- t)v(t) dt. Thus if follows that for all < e < x 



x 



-D{9\\9 + x) - D{9\\9 + e) 



x 



>+x 



(9 + x- t)v(t) dt 



+ e- t)v(t) dt 



EX £ 

> — inf v(t) — — sup v(t). 
2 te[e,e+x] ' 2 t e[e,e+x] 
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Maximizing over < e < x then yields 

£ ntawa , \ r>fa\\a . W x2 ini te[e,e+x] v(t) 2 
sup -D(0\\0 + x) -D(0\\0 + e) > — y-r. 

This proves that 



+ , \ ^ x mf v<t<w v(t) 2 

Kj[y, w, x, y) > — j-r - y. 

4 sup v<t<w v t 



Likewise, one finds 



x 2 



kJ(v,w,x) > — inf 

Z v<t<w 



The estimates for k 1 and k 2 are derived analogously. □ 



The combination of Theorem 7.10 and the estimates in Lemma 7.11 yield the handy result 
in Theorem 2.2 For the case of Gaussian observations, the constants nf (i = 1, 2) can be 



computed explicitly and in particular K\ is strictly larger than the approximations obtained 
from Lemma 7.11 by setting v(t) = 1. 



Theorem 7.12. Let gel and K(q) be defined as in (17) and assume that T is the family 
of Gaussian distributions with fixed variance 1. Then, 

2 



P (k{q) < K) <2K 



1 / AVXn 



exp 



8 \ 2x/2 



q 




+ exp 



16 



Proof. The proof is similar to the proof of Lemma 7.3 From Lemma 7.11 it follows that 



and one computes explicitly that K,f(v,w,x,y) 



k 2 (v,w,x,y) = 

2y) 2 if x 2 > 2y. The assertion now follows from Theorem 7.10 



2V2 



IT > §(* 



□ 



We close this section with the proof of Theorem 2.8 which is very much in the same spirit 
than the proof of Theorem 7.10| above. 



Proof of Theorem 2.8[ Let again A be the smallest jump of the true signal i9 and recall that 
£ [0; @\ f° r a ll t G [0, 1]. Moreover, define the K disjoint intervals J, = (t*- 



Ti + C n ) C 



7.10 



[0, 1] and accordingly 7~, 07", Of and i?j as in the proof of Theorem 
Now assume that K G N and that $ G >S n [i^] is an estimator of $ such that T n (Y, $) < q 
and 

Tk\ > c n . (52) 



max mm 

0<KK0<k<K 



10 
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By (52) we find an index i G {1, . . . , K} such that $ contains no change-point in the 
interval T. With the very same reasoning as in the proof of Theorem 7.10 we find that 



P [3K G G S n [K) : TJY, ■&) < q and max min \f t - r k \ > c n 

0<l<kO<k<K 



and Tj-XO) < - U+ A/log 



By replacing A/2 in the proof of Theorem 7.10 by c n and applying Lemma 7.11 the assertion 
follows. □ 



7.4. Proof of Theorems Q and I2T71 



Proof of Theorem 2.6. W.l.o.g. we shall assume that 5 n > 0. The main idea of the proof 



is as follows: Let J n = argmax{| J\ : J C [0, 1], J Dl n — 0}. In order to show that (23) 
holds, we construct a sequence #* G O such that 

(53) 
(54) 



sup^, P [TjSX 0) < 1/2 (q + y/2\og(e/\J n \)) J ^ and 

su P ,< e; , p (r In x, o) < 1/2 (q + VWW)) 2 ) -> o. 



Note that the true signal takes the value #0 + S n on J n and 9 on J n and it is not 
restrictive to assume that inf„ g N \J n \ > 0. We construct = 6> + y/ f3 n /n for a sequence 
(AOneN tnat satisfies y//3 n /q n ->■ 00. 



We first consider (53 ). To this end observe that for all t G J„ we have \9* n — fijt) \ ^J\Jjn 



y/Pn \J n \- We further find that 

r I n l 7 l Ari — ? nTh I 1 \/ 2 Me/ 141 A 
r Jn : = VP« K«l - Qn ~ V 2 log(e/ I J n |) = g n 1 

y Qn Qn I 



— > OO. 



Thus, we can apply (46) and find for all 9 > 9, 



P (r Jn X9) < 1/2 (q + v/21og(e/|J„|)) 2 ) < exp (- 



Jn 



->■ 0. 



Now observe that for t G I„ we have |#* — i? n (t)| \f\I n \ n — $ny\In\ n ~~ y/Pn \IJ Thus 



(54) follows from (46) given 



T/ n := 6 n y/\I n \ n - a//3„ \I n \ -q n - V 2 lo g( e / |X|) -> oo. 

It hence remains to construct sequences (p n ) for each case (1) and (2) such that the previous 
condition holds while \ff3^/q n — »■ oo. 
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We assume first that limmfn^oo \I n \ > and define f3 n through the equation \J fl n \I n \ — 
c (s n ^/\I n \ n — q n — a/2 log(e/ \J n \) J for some < c < 1. Then, 

y/P n \I n \ _ f S ny /\I n \n A/21og(e/ \J n \) 

q n \ q n q n 

From the condition in case (1) of the theorem and the fact that \I n \ is bounded away from 
zero for large n, we find that \/j3 n /q n — > oo. Further we find T In = (1 — c)y/ ' (3 n \I n \ — > oo. 
Finally we consider the case when \I n \ — >■ and define (5 n through the equation a/ {3 n \I n \ = 
cs n ^/— log \ I n \ for some < c < 1. From the conditions in case (2) of the theorem and the 
inequality \Jx + 1 — \fx < 1/ (2y/x), which holds for any x > 0, one obtains 

r Jn > (v^ + £„)>/- log |/ n | - -q n - a/2 log (e/ |4|) 

= (V2 + (1 - c)e n ) a/- log \I n \ -q n - v^a/I + log(l/ \I n \) 

> ((1 -c)e n )^/-\og\I n \ - ' 



-2 log \L. 



Tins shows t licit -T/ n y oo for ti su.it ciblc sixicill such thcit sup neN qj (e n A/log(l/ |/ n |)) < 
1 — 2c. Again from the assumptions in the theorem it follows that \/j3 n /q n — > oo. □ 



Proof of Theorem 2.1 . Theorem 7.12 implies P(K(q n ) < K n ) < e Tl ' n + e r2,n with 



2 

= 1 f - g„, - v / 21og(2e/A)') - logK n and T 2 , n = - log JT B . 

8 V 2a/2 / + 16 

It is easy to see, that any condition (1) - (3) implies r 2i „ —> oo. It remains to check that 
r ln — > oo. Under condition (1) we observe that 

r x , n = l / V^KAn q n + a/2 fog(2e/A) \ 2 log if n > ^ 

ql 8\ 2V2q n q n J + q 2 n °°' 

Since q n is bounded away from zero, the assertion follows. Next, we consider conditions 
(2) and (3). To this end, assume that y/n\ n A n > (C + e n )^\og(l/X n ) for some constant 
C > and a sequence e n such that e n A/log(l/A n ) — > oo. We find that 



1 / (C + £„) ,/log f 

ri,„>-| ^ --g n -A/21og(2e/A) ] -log** 



2 



1 / £ ™V log i /. T/L7-4A l + log2 , 



2a/2 V A n V 2a/2 / A/21og(l/A) 
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where we have used the inequality \Jx + 1 — \fx < l/(2y/x). If sup neN K n < oo, then the 
choice C = 4 implies I\ n — >■ oo. Otherwise, we use the estimate K n < 1/A n which results 
in C = 12 as a sufficient condition for r ln — > oo. □ 

7.5. Proof of Lemma 13.11 

Proof. First observe that the definition of $(g) in ([6]) implies that q > T n (Y,$(q)) and 
hence, by identifying $(g) with the pair (V(q),9(q)), we find 



(#(g) + l)<z > (£(<?) + l)T n (Y, 0(g)) > ^ ( ^2T T (Y, 0(g)) - ^2 log(e/ |/|) 

/£P(g) 



>V2 Yl (\I\<P(Y I ))-l(Yj(q)-n^2h^{ 



en) 



> y/2y/ 1{Y ,ti{q)) -l(Y,m~ l (Y)) - nv^logtera) 



The last inequality follows from the fact that <j)(Y i) > Y j9 — ip{9) for all 9 e O and all 
/ G "P(g) for the choice 9 = m _1 (y). Summarizing, we find 

7> + l)g + nv/21og(en)) 2 /2 + l(Y, mT^Y)) >l(Yj{q)). 



Now, let $ = (V,9) be a minimizer of (30). The definition of K(q) in (17) implies that 
D(V,9) = oo if #V < K{q). Thus we have that \V\ > K(q). Assume that there exists 
k > 1 such that j^V = K{q) + k (for fc = nothing is to show). Since (V, 9) is a minimizer 
of (30) and since D > we find 

7 (|7>| - 1) < D{V{q),9(q)) - D(Vj)+i(\V(q)\ - l) 
<D(V(q),9(q))-k 1 + l(\V\-l) 
<(l-fc)Z(y,£(g))+ 7 (|£|-l). 

This is a contradiction for /(■#) being non-negative and hence we conclude that |P| = i^(g) 
and that = (V, 9) solves Q. □ 
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